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The modern science of networks has brought significant advances to our understanding of complex 
systems. One of the most relevant features of graphs representing real systems is community 
structure, or clustering, i. e. the organization of vertices in clusters, with many edges joining 
vertices of the same cluster and comparatively few edges joining vertices of different clusters. Such 
clusters, or communities, can be considered as fairly independent compartments of a graph, playing 
a similar role like, e. g., the tissues or the organs in the human body. Detecting communities 
is of great importance in sociology, biology and computer science, disciplines where systems are 
often represented as graphs. This problem is very hard and not yet satisfactorily solved, despite 
the huge effort of a large interdisciplinary community of scientists working on it over the past few 
years. We will attempt a thorough exposition of the topic, from the definition of the main elements 
of the problem, to the presentation of most methods developed, with a special focus on techniques 
designed by statistical physicists, from the discussion of crucial issues like the significance of 
clustering and how methods should be tested and compared against each other, to the description 
of applications to real networks. 
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I. INTRODUCTION 

The origin of graph theory dates back to Euler's solu- 
tion of the puzzle of Konigsberg's bridges in 1736 (Euler, 
1736). Since then a lot has been learned about graphs 
and their mathematical properties (Bollobas, 1998). In 
the 20th century they have also become extremely useful 
as representation of a wide variety of systems in different 
areas. Biological, social, technological, and information 
networks can be studied as graphs, and graph analysis 
has become crucial to understand the features of these 
systems. For instance, social network analysis started in 
the 1930 's and has become one of the most important 
topics in sociology (Scott, 2000; Wasserman and Faust, 
1994). In recent times, the computer revolution has pro- 
vided scholars with a huge amount of data and computa- 
tional resources to process and analyze these data. The 
size of real networks one can potentially handle has also 
grown considerably, reaching millions or even billions of 
vertices. The need to deal with such a large number of 
units has produced a deep change in the way graphs are 
approached (Albert and Barabasi, 2002; Barrat et ai, 
2008; Boccaletti et ai, 2006; Mendes and Dorogovtsev, 
2003; Newman, 2003; Pastor-Satorras and Vespignani, 
2004). 

Graphs representing real systems are not regular like, 
e. g., lattices. They are objects where order coexists with 
disorder. The paradigm of disordered graph is the ran- 
dom graph, introduced by P. Erdos and A. Renyi (Erdos 
and Renyi, 1959). In it, the probability of having an 
edge between a pair of vertices is equal for all possible 
pairs (see Appendix). In a random graph, the distribu- 
tion of edges among the vertices is highly homogeneous. 
For instance, the distribution of the number of neigh- 
bours of a vertex, or degree, is binomial, so most ver- 
tices have equal or similar degree. Real networks are 
not random graphs, as they display big inhomogeneities, 
revealing a high level of order and organization. The de- 
gree distribution is broad, with a tail that often follows 
a power law: therefore, many vertices with low degree 
coexist with some vertices with large degree. Further- 
more, the distribution of edges is not only globally, but 
also locally inhomogeneous, with high concentrations of 
edges within special groups of vertices, and low concen- 
trations between these groups. This feature of real net- 
works is called community structure (Girvan and New- 
man, 2002), or clustering, and is the topic of this review 
(for earlier reviews see Refs. (Danon et at, 2007; Fortu- 
nato and Castellano, 2009; Newman, 2004a; Porter et ai, 
2009; Schaeffer, 2007)). Communities, also called clusters 
or modules, are groups of vertices which probably share 
common properties and/or play similar roles within the 
graph. In Fig. 1 a schematic example of a graph with 
communities is shown. 

Society offers a wide variety of possible group organi- 
zations: families, working and friendship circles, villages, 
towns, nations. The diffusion of Internet has also led 
to the creation of virtual groups, that live on the Web, 




FIG. 1 A simple graph with three communities, enclosed 
by the dashed circles. Reprinted figure with permission from 
Ref. (Fortunate and Castellano, 2009). ©2009 by Springer. 



like online communities. Indeed, social communities have 
been studied for a long time (Coleman, 1964; Freeman, 
2004; Kottak, 2004; Moody and White, 2003). Communi- 
ties also occur in many networked systems from biology, 
computer science, engineering, economics, politics, etc. 
In protein-protein interaction networks, communities are 
likely to group proteins having the same specific function 
within the cell (Chen and Yuan, 2006; Rives and Galitski, 
2003; Spirin and Mirny, 2003), in the graph of the World 
Wide Web they may correspond to groups of pages deal- 
ing with the same or related topics (Dourisboure et al., 
2007; Flake et ai, 2002), in metabolic networks they may 
be related to functional modules such as cycles and path- 
ways (Guimera and Amaral, 2005; Palla et ai, 2005), 
in food webs they may identify compartments (Krause 
et ai, 2003; Pimm, 1979), and so on. 

Communities can have concrete applications. Cluster- 
ing Web clients who have similar interests and are ge- 
ografically near to each other may improve the perfor- 
mance of services provided on the World Wide Web, in 
that each cluster of clients could be served by a dedi- 
cated mirror server (Krishnamurthy and Wang, 2000). 
Identifying clusters of customers with similar interests 
in the network of purchase relationships between cus- 
tomers and products of online retailers (like, e. g., 
www.cmiazon.com) enables to set up efficient recommen- 
dation systems (Reddy et al., 2002), that better guide 
customers through the list of items of the retailer and 
enhance the business opportunities. Clusters of large 
graphs can be used to create data structures in order 
to efficiently store the graph data and to handle naviga- 
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tional queries, like path searches (Agrawal and Jagadish, 
1994; Wu et ai, 2004). Ad hoc networks (Perkins, 2001), 
i. e. self-configuring networks formed by communication 
nodes acting in the same region and rapidly changing 
(because the devices move, for instance), usually have 
no centrally maintained routing tables that specify how 
nodes have to communicate to other nodes. Grouping the 
nodes into clusters enables one to generate compact rout- 
ing tables while the choice of the communication paths 
is still efficient (Steenstrup, 2001). 

Community detection is important for other reasons, 
too. Identifying modules and their boundaries allows for 
a classification of vertices, according to their structural 
position in the modules. So, vertices with a central posi- 
tion in their clusters, i. e. sharing a large number of edges 
with the other group partners, may have an important 
function of control and stability within the group; ver- 
tices lying at the boundaries between modules play an im- 
portant role of mediation and lead the relationships and 
exchanges between different communities (alike to Cser- 
mely's "creative elements" (Csermely, 2008)). Such clas- 
sification seems to be meaningful in social (Burt, 1976; 
Freeman, 1977; Granovetter, 1973) and metabolic net- 
works (Guimera and Amaral, 2005). Finally, one can 
study the graph where vertices are the communities and 
edges are set between clusters if there are connections be- 
tween some of their vertices in the original graph and/or 
if the modules overlap. In this way one attains a coarse- 
grained description of the original graph, which unveils 
the relationships between modules ^ . Recent studies indi- 
cate that networks of communities have a different degree 
distribution with respect to the full graphs (Palla et ai, 
2005); however, the origin of their structures can be ex- 
plained by the same mechanism (Pollner et ai, 2006). 

Another important aspect related to community struc- 
ture is the hierarchical organization displayed by most 
networked systems in the real world. Real networks are 
usually composed by communities including smaller com- 
munities, which in turn include smaller communities, etc. 
The human body offers a paradigmatic example of hier- 
archical organization: it is composed by organs, organs 
are composed by tissues, tissues by cells, etc. Another 
example is represented by business firms, who are char- 
acterized by a pyramidal organization, going from the 
workers to the president, with intermediate levels corre- 
sponding to work groups, departments and management. 
Herbert A. Simon has emphasized the crucial role played 
by hierarchy in the structure and evolution of complex 



^ Coarse-graining a graph generally means mapping it onto a 
smaller graph having similar properties, which is easier to handle. 
For this purpose, the vertices of the original graph arc not nec- 
essarily grouped in communities. Gfeller and De Los Rios have 
proposed coarse-graining schemes that keep the properties of dy- 
namic processes acting on the graph, like random walks (Gfeller 
and De Los Rios, 2007) and synchronization (Gfeller and De Los 
Rios, 2008). 



systems (Simon, 1962). The generation and evolution of 
a system organized in interrelated stable subsystems are 
much quicker than if the system were unstructured, be- 
cause it is much easier to assemble the smallest subparts 
first and use them as building blocks to get larger struc- 
tures, until the whole system is assembled. In this way 
it is also far more difficult that errors (mutations) occur 
along the process. 

The aim of community detection in graphs is to iden- 
tify the modules and, possibly, their hierarchical orga- 
nization, by only using the information encoded in the 
graph topology. The problem has a long tradition and it 
has appeared in various forms in several disciplines. The 
first analysis of community structure was carried out by 
Weiss and Jacobson (Weiss and Jacobson, 1955), who 
searched for work groups within a government agency. 
The authors studied the matrix of working relationships 
between members of the agency, which were identified by 
means of private interviews. Work groups were separated 
by removing the members working with people of differ- 
ent groups, which act as connectors between them. This 
idea of cutting the bridges between groups is at the ba- 
sis of several modern algorithms of community detection 
(Section V). Research on communities actually started 
even earlier than the paper by Weiss and Jacobson. Al- 
ready in 1927, Stuart Rice looked for clusters of people 
in small political bodies, based on the similarity of their 
voting patterns (Rice, 1927). Two decades later, George 
Homans showed that social groups could be revealed by 
suitably rearranging the rows and the columns of matri- 
ces describing social ties, until they take an approximate 
block-diagonal form (Homans, 1950). This procedure is 
now standard. Meanwhile, traditional techniques to find 
communities in social networks are hierarchical cluster- 
ing and partitional clustering (Sections IV. B and IV. C), 
where vertices are joined into groups according to their 
mutual similarity. 

Identifying graph communities is a popular topic in 
computer science, too. In parallel computing, for in- 
stance, it is crucial to know what is the best way to 
allocate tasks to processors so as to minimize the commu- 
nications between them and enable a rapid performance 
of the calculation. This can be accomplished by splitting 
the computer cluster into groups with roughly the same 
number of processors, such that the number of physi- 
cal connections between processors of different groups is 
minimal. The mathematical formalization of this prob- 
lem is called graph partitioning (Section IV. A). The first 
algorithms for graph partitioning were proposed in the 
early 1970's. 

In a seminal paper appeared in 2002, Girvan and New- 
man proposed a new algorithm, aiming at the identifica- 
tion of edges lying between communities and their suc- 
cessive removal, a procedure that after some iterations 
leads to the isolation of the communities (Girvan and 
Newman, 2002). The intercommunity edges are detected 
according to the values of a centrality measure, the edge 
betweenness, that expresses the importance of the role 
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of the edges in processes where signals are transmitted 
across the graph following paths of minimal length. The 
paper triggered a big activity in the field, and many new 
methods have been proposed in the last years. In partic- 
ular, physicists entered the game, bringing in their tools 
and techniques: spin models, optimization, percolation, 
random walks, synchronization, etc., became ingredients 
of new original algorithms. The field has also taken ad- 
vantage of concepts and methods from computer science, 
nonlinear dynamics, sociology, discrete mathematics. 

In this manuscript we try to cover in some detail the 
work done in this area. We shall pay a special atten- 
tion to the contributions made by physicists, but we shall 
also give proper credit to important results obtained by 
scholars of other disciplines. Section II introduces com- 
munities in real networks, and is supposed to make the 
reader acquainted with the problem and its relevance. In 
Section III we define the basic elements of community 
detection, i. e. the concepts of community and parti- 
tion. Traditional clustering methods in computer and 
social sciences, i. e. graph partitioning, hierarchical, 
partitional and spectral clustering are reviewed in Sec- 
tion IV. Modern methods, divided into categories based 
on the type of approach, are presented in Sections V 
to X. Algorithms to find overlapping communities, mul- 
tiresolution and hierarchical techniques, are separately 
described in Sections XI and XII, respectively, whereas 
Section XIII is devoted to the detection of communities 
evolving in time. We stress that our categorization of the 
algorithms is not sharp, because many algorithms may 
enter more categories: we tried to classify them based 
on what we believe is their main feature/purpose, even 
if other aspects may be present. Sections XIV and XV 
are devoted to the issues of defining when community 
structure is significant, and deciding about the quality of 
algorithms' performances. In Sections XVI and XVII we 
describe general properties of clusters found in real net- 
works, and specific applications of clustering algorithms. 
Section XVIII contains the summary of the review, along 
with a discussion about future research directions in this 
area. The review makes use of several concepts of graph 
theory, that are defined and explained in the Appendix. 
Readers not acquainted with these concepts are urged to 
read the Appendix first. 



II. COMMUNITIES IN REAL-WORLD NETWORKS 

In this section we shall present some striking examples 
of real networks with community structure. In this way 
we shall see what communities look like and why they 
are important. 

Social networks are paradigmatic examples of graphs 
with communities. The word community itself refers to 
a social context. People naturally tend to form groups, 
within their work environment, family, friends. 

In Fig. 2 we show some examples of social networks. 
The first example (Fig. 2a) is Zachary's network of karate 



club members (Zachary, 1977), a well-known graph reg- 
ularly used as a benchmark to test community detection 
algorithms (Section XV. A). It consists of 34 vertices, the 
members of a karate club in the United States, who were 
observed during a period of three years. Edges connect 
individuals who were observed to interact outside the ac- 
tivities of the club. At some point, a conflict between 
the club president and the instructor led to the fission of 
the club in two separate groups, supporting the instruc- 
tor and the president, respectively (indicated by squares 
and circles). The question is whether from the original 
network structure it is possible to infer the composition 
of the two groups. Indeed, by looking at Fig. 2a one 
can distinguish two aggregations, one around vertices 33 
and 34 (34 is the president), the other around vertex 1 
(the instructor). One can also identify several vertices 
lying between the two main structures, like 3, 9, 10; such 
vertices are often misclassified by community detection 
methods. 

Fig. 2b displays the largest connected component of 
a network of collaborations of scientists working at the 
Santa Fe Institute (SFI). There are 118 vertices, repre- 
senting resident scientists at SFI and their collaborators. 
Edges are placed between scientists that have published 
at least one paper together. The visualization layout al- 
lows to distinguish disciplinary groups. In this network 
one observes many cliques, as authors of the same pa- 
per are all linked to each other. There are but a few 
connections between most groups. 

In Fig. 2c we show the network of bottlenose dol- 
phins living in Doubtful Sound (New Zealand) analyzed 
by Lusseau (Lusseau, 2003). There are 62 dolphins and 
edges were set between animals that were seen together 
more often than expected by chance. The dolphins sep- 
arated in two groups after a dolphin left the place for 
some time (squares and circles in the figure). Such groups 
are quite cohesive, with several internal cliques, and eas- 
ily identifiable: only six edges join vertices of differ- 
ent groups. Due to this natural classification Lusseau's 
dolphins' network, like Zachary's karate club, is often 
used to test algorithms for community detection (Sec- 
tion XV. A). 

Protein-protein interaction (PPI) networks are subject 
of intense investigations in biology and bioinformatics, 
as the interactions between proteins are fundamental for 
each process in the cell (Zhang, 2009). Fig. 3 illustrates 
a PPI network of the rat proteome (Jonsson et al., 2006). 
Each interaction is derived by homology from experimen- 
tally observed interactions in other organisms. In our 
example, the proteins interact very frequently with each 
other, as they belong to metastatic cells, which have a 
high motility and invasiveness with respect to normal 
cells. Communities correspond to functional groups, i. e. 
to proteins having the same or similar functions, which 
are expected to be involved in the same processes. The 
modules are labeled by the overall function or the dom- 
inating protein class. Most communities are associated 
to cancer and metastasis, which indirectly shows how im- 
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FIG. 2 Community structure in social networks, a) Zachary's karate club, a standard benchmark in community detection. The 
colors correspond to the best partition found by optimizing the modularity of Newman and Girvan (Section VI. A). Reprinted 
figure with permission from Ref. (Donetti and Munoz, 2004). ©2004 by lOP Publishing and SISSA. b) Collaboration network 
between scientists working at the Santa Fe Institute. The colors indicate high level communities obtained by the algorithm 
of Girvan and Newman (Section V.A) and correspond quite closely to research divisions of the institute. Further subdivisions 
correspond to smaller research groups, revolving around project leaders. Reprinted figure with permission from Ref. (Girvan 
and Newman, 2002). ©2002 by the National Academy of Science of the USA. c) Lusseau's network of bottlenose dolphins. 
The colors label the communities identified through the optimization of a modified version of the modularity of Newman and 
Girvan, proposed by Arenas et al. (Arenas et al, 2008b) (Section XII. A). The partition matches the biological classification 
of the dolphins proposed by Lusseau. Reprinted figure with permission from Ref. (Arenas et al, 2008b). ©2008 by lOP 
Publishing. 



portant detecting modules in PPI networks is. 

Relationships/interactions between elements of a sys- 
tem need not be reciprocal. In many cases they have a 
precise direction, that needs to be taken into account to 
understand the system as a whole. As an example we can 
cite predator-prey relationships in food webs. In Fig. 4 
we see another example, taken from technology. The 
system is the World Wide Web, which can be seen as a 
graph by representing web pages as vertices and the hy- 
perlinks that make users move from one page to another 
as edges (Albert et al., 1999). Hyperlinks are directed: 
if one can move from page A to page B by clicking on a 



hyperlink of A, one usually does not find on B a hyper- 
link taking back to A. In fact, very few hyperlinks (less 
than 10%) are reciprocal. Communities of the web graph 
are groups of pages having topical similarities. Detect- 
ing communities in the web graph may help to identify 
the artificial clusters created by link farms in order to 
enhance the PageRank (Brin and Page, 1998) value of 
web sites and grant them a higher Google ranking. In 
this way one could discourage this unfair practice. One 
usually assumes that the existence of a hyperlink between 
two pages implies that they are content-related, and that 
this similarity is independent of the hyperlink direction. 



6 




FIG. 3 Community structure in protein-protein interaction networks. The graph pictures the interactions between proteins 
in cancerous cells of a rat. Communities, labeled by colors, were detected with the Clique Percolation Method by Palla et al. 
(Section XI. A). Reprinted figure with permission from Ref. (Jonsson et al, 2006). ©2006 by PubMed Central. 



Therefore it is customary to neglect the directedness of 
the hyperlinks and to consider the graph as undirected, 
for the purpose of community detection. On the other 
hand, taking properly into account the directedness of 
the edges can considerably improve the quality of the par- 
tition(s), as one can handle a lot of precious information 
about the system. Moreover, in some instances neglect- 
ing edge directedness may lead to strange results (Leicht 



and Newman, 2008; Rosvall and Bergstrom, 2008). De- 
veloping methods of community detection for directed 
graphs is a hard task. For instance, a directed graph is 
characterized by asymmetrical matrices (adjacency ma- 
trix, Laplacian, etc.), so spectral analysis is much more 
complex. Only a few techniques can be easily extended 
from the undirected to the directed case. Otherwise, the 
problem must be formulated from scratch. 
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FIG. 4 Community structure in technological networks. 
Sample of the web graph consisting of the pages of a web 
site and their mutual hyperlinks, which are directed. Com- 
munities, indicated by the colors, were detected with the al- 
gorithm of Girvan and Newman (Section V.A), by neglecting 
the directedness of the edges. Reprinted figure with permis- 
sion from Ref. (Newman and Girvan, 2004). ©2004 by the 
American Physical Society. 



Edge directedness is not the only complication to deal 
with when facing the problem of graph clustering. In 
many real networks vertices may belong to more than 
one group. In this case one speaks of overlapping com- 
munities and uses the term cover, rather than partition, 
whose standard definition forbids multiple memberships 
of vertices. Classical examples are social networks, where 
an individual usually belongs to different circles at the 
same time, from that of work colleagues to family, sport 
associations, etc.. Traditional algorithms of community 
detection assign each vertex to a single module. In so do- 
ing, they neglect potentially relevant information. Ver- 
tices belonging to more communities are likely to play 
an important role of intermediation between different 
compartments of the graph. In Fig. 5 we show a net- 
work of word association derived starting from the word 
"bright" . The network builds on the University of South 
Florida Free Association Norms (Nelson et ai, 1998). An 
edge between words A and B indicates that some peo- 
ple associate B to the word A. The graph clearly dis- 




FIG. 5 Overlapping communities in a network of word as- 
sociation. The groups, labeled by the colors, were detected 
with the Clique Percolation Method by Palla et al. (Sec- 
tion XI. A). Reprinted figure with permission from Ref. (Palla 
et ai, 2005). ©2005 by the Nature Publishing Group. 



plays four communities, corresponding to the categories 
Intelligence, Astronomy, Light and Colors. The word 
"bright" is related to all of them by construction. Other 
words belong to more categories, e.g. "dark" (Colors 
and Light). Accounting for overlapping communities in- 
troduces a further variable, the membership of vertices 
in different communities, which enormously increases the 
number of possible covers with respect to standard parti- 
tions. Therefore, searching for overlapping communities 
is usually much more computationally demanding than 
detecting standard partitions. 

So far we have discussed examples of unipartite graphs. 
However, it is not uncommon to find real networks with 
different classes of vertices, and edges joining only ver- 
tices of different classes. An example is a network of 
scientists and papers, where edges join scientists and the 
papers they have authored. Here there is no edge be- 
tween any pair of scientists or papers, so the graph is 
bipartite. For a multipartite network the concept of com- 
munity does not change much with respect to the case of 
unipartite graphs, as it remains related to a large den- 
sity of edges between members of the same group, with 
the only difference that the elements of each group be- 
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FIG. 6 Community structure in multipartite networks. This 
bipartite graph refers to the Southern Women Event Partici- 
pation data set. Women are represented as open symbols with 
black labels, events as filled symbols with white labels. The 
illustrated vertex partition has been obtained by maximizing 
a modified version of the modularity by Newman and Girvan, 
tailored on bipartite graphs (Barber, 2007) (Section VLB). 
Reprinted figure with permission from Ref. (Barber, 2007). 
©2007 by the American Physical Society. 



long to different vertex classes. Multipartite graphs are 
usually reduced to unipartite projections of each vertex 
class. For instance, from the bipartite network of scien- 
tists and papers one can extract a network of scientists 
only, who are related by coauthorship. In this way one 
can adopt standard techniques of network analysis, in 
particular standard clustering methods, but a lot of infor- 
mation gets lost. Detecting communities in multipartite 
networks can have interesting applications in, e.g., mar- 
keting. Large shopping networks, in which customers are 
linked to the products they have bought, allow to classify 
customers based on the types of product they purchase 
more often: this could be used both to organize targeted 
advertising, as well as to give recommendations about 
future purchases (Adomavicius and Tuzhilin, 2005). The 
problem of community detection in multipartite networks 
is not trivial, and usually requires ad hoc methodologies. 
Fig. 6 illustrates the famous bipartite network of South- 
ern Women studied by Davis et al. (Davis et ai, 1941). 
There are 32 vertices, representing 18 women from the 
area of Natchez, Mississippi, and 14 social events. Edges 
represent the participation of the women in the events. 
From the figure one can see that the network has a clear 
community structure. 

In some of the previous examples, edges have (or can 
have) weights. For instance, the edges of the collabora- 
tion network of Fig. 2b could be weighted by the number 
of papers coauthored by pairs of scientists. Similarly, 



the edges of the word association network of Fig. 5 are 
weighted by the number of times pairs of words have been 
associated by people. Weights are precious additional in- 
formation on a graph, and should be considered in the 
analysis. In many cases methods working on unweighted 
graphs can be simply extended to the weighted case. 



III. ELEMENTS OF COMMUNITY DETECTION 

The problem of graph clustering, intuitive at first sight, 
is actually not well defined. The main elements of the 
problem themselves, i. e. the concepts of community and 
partition, are not rigorously defined, and require some 
degree of arbitrariness and/or common sense. Indeed, 
some ambiguities are hidden and there are often many 
equally legitimate ways of resolving them. Therefore, it 
is not surprising that there are plenty of recipes in the 
literature and that people do not even try to ground the 
problem on shared definitions. 

It is important to stress that the identification of struc- 
tural clusters is possible only if graphs are sparse, i. e. if 
the number of edges m is of the order of the number of 
nodes n of the graph. If m 3> ri, the distribution of edges 
among the nodes is too homogeneous for communities to 
make sense^. In this case the problem turns into some- 
thing rather different, close to data clustering (Gan et ai, 
2007) , which requires concepts and methods of a different 
nature. The main difference is that, while communities in 
graphs are related, explicitly or implicitly, to the concept 
of edge density (inside versus outside the community), in 
data clustering communities are sets of points which are 
"close" to each other, with respect to a measure of dis- 
tance or similarity, defined for each pair of points. Some 
classical techniques for data clustering, like hierarchical, 
partitional and spectral clustering will be discussed later 
in the review (Sections IV. B, IV. C and IV. D), as they 
are sometimes adopted for graph clustering too. Other 
standard procedures for data clustering include neural 
network clustering techniques like, e. g., self- organizing 
maps and multi- dimensional scaling techniques like, e. 
g., singular value decomposition and principal component 
analysis (Gan et al., 2007). 

In this section we shall attempt an ordered exposition 
of the fundamental concepts of community detection. Af- 
ter a brief discussion of the issue of computational com- 
plexity for the algorithms, we shall review the notions of 
community and partition. 



^ This is not necessarily true if graphs are weighted with a hetero- 
geneous distribution of weights. In such cases communities may 
still be identified as subgraphs with a high internal density of 
weight. 
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A. Computational complexity 

The massive amount of data on real networks currently 
available makes the issue of the efficiency of clustering al- 
gorithms essential. The computational complexity of an 
algorithm is the estimate of the amount of resources re- 
quired by the algorithm to perform a task. This involves 
both the number of computation steps needed and the 
number of memory units that need to be simultaneously 
allocated to run the computation. Such demands are 
usually expressed by their scalability with the size of the 
system at study. In the case of a graph, the size is typ- 
ically indicated by the number of vertices n and/or the 
number of edges m. The computational complexity of 
an algorithm cannot always be calculated. In fact, some- 
times this is a very hard task, or even impossible. In 
these cases, it is however important to have at least an 
estimate of the worst-case complexity of the algorithm, 
which is the amount of computational resources needed 
to run the algorithm in the most unfavorable case for a 
given system size. 

The notation 0{n°'m^) indicates that the computer 
time grows as a power of both the number of vertices 
and edges, with exponents a and /3, respectively. Ideally, 
one would like to have the lowest possible values for the 
exponents, which would correspond to the lowest possi- 
ble computational demands. Samples of the Web graph, 
with millions of vertices and billions of edges, cannot be 
tackled by algorithms whose running time grows faster 
than 0{n) or 0{m). 

Algorithms with polynomial complexity form the class 
P. For some important decision and optimization prob- 
lems, there are no known polynomial algorithms. Find- 
ing solutions of such problems in the worst-case scenario 
may demand an exhaustive search, which takes a time 
growing faster than any polynomial function of the sys- 
tem size, e.g. exponentially. Problems whose solutions 
can be verified in a polynomial time span the class NP 
of non- deterministic polynomial time problems, which in- 
cludes P. A problem is NP-hard if a solution for it can be 
translated into a solution for any NP-problem. However, 
a NP-hard problem needs not be in the class NP. If it 
does belong to NP it is called NP-complete. The class 
of NP-complete problems has drawn a special attention 
in computer science, as it includes many famous prob- 
lems like the Travelling Salesman, Boolean Satisfiability 
(SAT), Linear Programming, etc. (Garcy and Johnson, 
1990; Papadimitriou, 1994). The fact that NP prob- 
lems have a solution which is verifiable in polynomial 
time does not mean that NP problems have polynomial 
complexity, i. e., that they are in P. In fact, the ques- 
tion of whether NP=P is the most important open prob- 
lem in theoretical computer science. NP-hard problems 
need not be in NP (in which case they would be NP- 
complete), but they are at least as hard as NP-complete 
problems, so they are unlikely to have polynomial com- 
plexity, although a proof of that is still missing. 

Many clustering algorithms or problems related to 



clustering are NP-hard. In this case, it is pointless to 
use exact algorithms, which could be applied only to 
very small systems. Moreover, even if an algorithm has a 
polynomial complexity, it may still be too slow to tackle 
large systems of interest. In all such cases it is common 
to use approximation algorithms, i. e. methods that do 
not deliver an exact solution to the problem at hand, 
but only an approximate solution, with the advantage of 
a lower complexity. Approximation algorithms are often 
non-deterministic, as they deliver different solutions for 
the same problem, for different initial conditions and/or 
parameters of the algorithm. The goal of such algorithms 
is to deliver a solution which differs by a constant fac- 
tor from the optimal solution. In any case, one should 
give provable bounds on the goodness of the approxi- 
mate solution delivered by the algorithm with respect to 
the optimal solution. In many cases it is not possible 
to approximate the solution within any constant, as the 
goodness of the approximation strongly depends on the 
specific problem at study. Approximation algorithms are 
commonly used for optimization problems, in which one 
wants to find the maximum or minimum value of a given 
cost function over a large set of possible system configu- 
rations. 



B. Communities 

1. Basics 

The first problem in graph clustering is to look for a 
quantitative definition of community. No definition is 
universally accepted. As a matter of fact, the defini- 
tion often depends on the specific system at hand and/or 
application one has in mind. From intuition and the ex- 
amples of Section II we get the notion that there must 
be more edges "inside" the community than edges link- 
ing vertices of the community with the rest of the graph. 
This is the reference guideline at the basis of most com- 
munity definitions. But many alternative recipes are 
compatible with it. Moreover, in most cases, commu- 
nities are algorithmically defined, i. e. they are just the 
final product of the algorithm, without a precise a priori 
definition. 

Let us start with a subgraph C of a graph Q, with 
\C\ — Uc and \G\ ~ n vertices, respectively. We define 
the internal and external degree of vertex w S C, fc™* 
and fc^^*, as the number of edges connecting v to other 
vertices of C or to the rest of the graph, respectively. If 
fc^^* = 0, the vertex has neighbors only within C, which is 
likely to be a good cluster for u; if fc™* = 0, instead, the 
vertex is disjoint from C and it should better be assigned 
to a different cluster. The internal degree of C is 
the sum of the internal degrees of its vertices. Likewise, 
the external degree k^^^ of C is the sum of the external 
degrees of its vertices. The total degree kP is the sum 
of the degrees of the vertices of C. By definition, kP — 

/c-C J- hC 
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We define the intra-cluster density 5int(C) of tfie sub- 
graph C as the ratio between the number of internal edges 
of C and the number of all possible internal edges, i. e. 



SintiC) = 



^ internal edges of C 
ndric - l)/2 



(1) 



Similarly, the inter-cluster density Sext{C) is the ratio be- 
tween the number of edges running from the vertices of 
C to the rest of the graph and the maximum number of 
intcr-cluster edges possible, i. e. 



Sext{C) 



# inter-cluster edges of C 
nc{n - He) 



(2) 



For C to be a community, we expect Sint{C) to be ap- 
preciably larger than the average link density S{G) of 
G, which is given by the ratio between the number of 
edges of Q and the maximum number of possible edges 
n{n — l)/2. On the other hand, Sext{C) has to be much 
smaller than S{Q). Searching for the best tradeoff be- 
tween a large Sint{C) and a small S^xtiC) is implicitly 
or explicitly the goal of most clustering algorithms. A 
simple way to do that is, e. g., maximizing the sum of 
the differences Sint{C) — S^xtiC) over all clusters of the 
partition'^ (Mancoridis et ai, 1998). 

A required property of a community is connectedness. 
We expect that for C to be a community there must be 
a path between each pair of its vertices, running only 
through vertices of C. This feature simplifies the task 
of community detection on disconnected graphs, as in 
this case one just analyzes each connected component 
separately, unless special constraints are imposed on the 
resulting clusters. 

With these basic requirements in mind, we can now 
introduce the main definitions of community. Social 
network analysts have devised many definitions of sub- 
groups with various degrees of internal cohesion among 
vertices (Moody and White, 2003; Scott, 2000; Wasscr- 
man and Faust, 1994). Many other definitions have been 
introduced by computer scientists and physicists. We 
distinguish three classes of definitions: local, global and 
based on vertex similarity. Other definitions will be given 
in the context of the algorithms for which they were in- 
troduced. 



2. Local definitions 

Communities are parts of the graph with a few ties 
with the rest of the system. To some extent, they can 
be considered as separate entities with their own auton- 
omy. So, it makes sense to evaluate them independently 



^ In Ref. (Mancoridis et al., 1998) one actually computes the inter- 
cluster density by summing the densities for each pair of clusters. 
Therefore the function to minimize is not exactly X]c['^«"t('') ~ 
Sext{C)], but essentially equivalent. 



of the graph as a whole. Local definitions focus on the 
subgraph under study, including possibly its immediate 
neighborhood, but neglecting the rest of the graph. We 
start with a listing of the main definitions adopted in 
social network analysis, for which we shall closely fol- 
low the exposition of Ref. (Wasscrman and Faust, 1994). 
There, four types of criteria were identified: complete 
mutuality, reachability, vertex degree and the comparison 
of internal versus external cohesion. The corresponding 
communities are mostly maximal subgraphs, which can- 
not be enlarged with the addition of new vertices and 
edges without losing the property which defines them. 

Social communities can be defined in a very strict 
sense as subgroups whose members are all "friends" to 
each other (Luce and Perry, 1949) (complete mutual- 
ity). In graph terms, this corresponds to a clique, i. e. 
a subset whose vertices are all adjacent to each other. 
In social network analysis, a clique is a maximal sub- 
graph, whereas in graph theory it is common to call 
cliques also non-maximal subgraphs. Triangles are the 
simplest cliques, and are frequent in real networks. But 
larger cliques are less frequent. Moreover, the condition 
is really too strict: a subgraph with all possible internal 
edges except one would be an extremely cohesive sub- 
group, but it would not be considered a community un- 
der this recipe. Another problem is that all vertices of 
a clique are absolutely symmetric, with no differentia- 
tion between them. In many practical examples, instead, 
we expect that within a community there is a whole hi- 
erarchy of roles for the vertices, with core vertices co- 
existing with peripheral ones. We remark that vertices 
may belong to more cliques simultaneously, a property 
which is at the basis of the Clique Percolation Method of 
Palla et al. (Palla et ai, 2005) (see Section XI. A). From 
a practical point of view, finding cliques in a graph is an 
NP-complete problem (Bomzc et ai, 1999). The Bron- 
Kerbosch method (Bron and Kerbosch, 1973) runs in a 
time growing exponentially with the size of the graph. 

It is however possible to relax the notion of clique, 
defining subgroups which are still clique-like objects. A 
possibility is to use properties related to reachability, i. e. 
to the existence (and length) of paths between vertices. 
An n-clique is a maximal subgraph such that the distance 
of each pair of its vertices is not larger than n (Alba, 
1973; Luce, 1950). For n = 1 one recovers the definition 
of clique, as all vertices are adjacent, so each geodesic 
path between any pair of vertices has length 1. This def- 
inition, more flexible than that of clique, still has some 
limitations, deriving from the fact that the geodesic paths 
need not run on the vertices of the subgraph at study, but 
may run on vertices outside the subgraph. In this way, 
there may be two disturbing consequences. First, the 
diameter of the subgraph may exceed n, even if in princi- 
ple each vertex of the subgraph is less than n steps away 
from any of the others. Second, the subgraph may be 
disconnected, which is not consistent with the notion of 
cohesion one tries to enforce. To avoid these problems, 
Mokken (Mokkcn, 1979) has suggested two possible al- 
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ternatives, the n-clan and the n-club. An n-clan is an 
n-cUque whose diameter is not larger than n, i. e. a sub- 
graph such that the distance between any two of its ver- 
tices, computed over shortest paths within the subgraph, 
does not exceed n. An n-club, instead, is a maximal 
subgraph of diameter n. The two definitions are quite 
close: the difference is that an n-clan is maximal under 
the constraint of being an n-clique, whereas an n-club is 
maximal under the constraint imposed by the length of 
the diameter. 

Another criterion for subgraph cohesion relies on the 
adjacency of its vertices. The idea is that a vertex must 
be adjacent to some minumum number of other vertices 
in the subgraph. In the literature on social network anal- 
ysis there are two complementary ways of expressing this. 
A k-plex is a maximal subgraph in which each vertex is 
adjacent to all other vertices of the subgraph except at 
most k of them (Scidnian and Foster, 1978). Similarly, 
a k-core is a maximal subgraph in which each vertex is 
adjacent to at least k other vertices of the subgraph (Sei- 
dman, 1983). So, the two definitions impose conditions 
on the minimal number of absent or present edges. The 
corresponding clusters are more cohesive than n-cliques, 
just because of the existence of many internal edges. In 
any graph there is a whole hierarchy of cores of different 
order, which can be identified by means of a recent effi- 
cient algorithm (Batagelj and Zaversnik, 2003). A fc-corc 
is essentially the same as a p- quasi complete subgraph, 
which is a subgraph such that the degree of each vertex 
is larger than p{k — 1), where p is a real number in [0, 1] 
and k the order of the subgraph (Matsuda et ai, 1999). 
Determining whether a graph includes a 1/2-quasi com- 
plete subgraph of order at least k is NP-complete. 

As cohesive as a subgraph can be, it would hardly be a 
community if there is a strong cohesion as well between 
the subgraph and the rest of the graph. Therefore, it 
is important to compare the internal and external cohe- 
sion of a subgraph. In fact, this is what is usually done 
in the most recent definitions of community. The first 
recipe, however, is not recent and stems from social net- 
work analysis. An LS-set (Luccio and Sami, 1969), or 
strong community (Radicchi et al., 2004), is a subgraph 
such that the internal degree of each vertex is greater 
than its external degree. This condition is quite strict 
and can be relaxed into the so-called weak definition of 
community (Radicchi et al., 2004), for which it suffices 
that the internal degree of the subgraph exceeds its ex- 
ternal degree. An LS-set is also a weak community, while 
the converse is not generally true. Hu et al. (Hu et ai, 
2008) have introduced alternative definitions of strong 
and weak communities: a community is strong if the in- 
ternal degree of any vertex of the community exceeds the 
number of edges that the vertex shares with any other 
community; a community is weak if its total internal de- 
gree exceeds the number of edges shared by the commu- 
nity with the other communities. These definitions are 
in the same spirit of the planted partition model (Sec- 
tion XV). An LS'-set is a strong community also in the 



sense of Hu et al.. Likewise, a weak community according 
to Radicchi et al. is also a weak community for Hu et al.. 
In both cases the converse is not true, however. Another 
definition focuses on the robustness of clusters to edge 
removal and uses the concept of edge connectivity. The 
edge connectivity of a pair of vertices in a graph is the 
minimal number of edges that need to be removed in or- 
der to disconnect the two vertices, i. e. such that there is 
no path between them. A lambda set is a subgraph such 
that any pair of vertices of the subgraph has a larger edge 
connectivity than any pair formed by one vertex of the 
subgraph and one outside the subgraph (Borgatti et al., 
1990). However, vertices of a lambda-set need not be 
adjacent and may be quite distant from each other. 

Communities can also be identified by a fitness mea- 
sure, expressing to which extent a subraph satisfies a 
given property related to its cohesion. The larger the 
fitness, the more definite is the community. This is the 
same principle behind quality functions, which give an 
estimate of the goodness of a graph partition (see Sec- 
tion III.C.2). The simplest fitness measure for a clus- 
ter is its intra-cluster density 6int{C). One could as- 
sume that a subgraph C with k vertices is a cluster if 
Sint{C) is larger than a threshold, say ^. Finding such 
subgraphs is an NP-complete problem, as it coincides 
with the NP-complete Clique Problem when the thresh- 
old ^ = 1 (Carey and Johnson, 1990). It is better to fix 
the size of the subgraph because, without this conditions, 
any clique would be one of the best possible communities, 
including trivial two-cliques (simple edges). Variants of 
this problem focus on the number of internal edges of 
the subgraph (Asahiro et al., 2002; Feigc et al, 2001; 
Holzapfcl et al., 2003). Another measure of interest is the 
relative density p{C) of a subgraph C, defined as the ratio 
between the internal and the total degree of C Finding 
subgraphs of a given size with p{C) larger than a thresh- 
old is NP-complete (Sima and Schaeffer, 2006). Fitness 
measures can also be associated to the connectivity of 
the subgraph at study to the other vertices of the graph. 
A good community is expected to have a small cut size 
(see Section A.l), i. e. a small number of edges joining 
it to the rest of the graph. This sets a bridge between 
community detection and graph partitioning, which we 
shall discuss in Section IV. A. 



3. Global definitions 

Communities can also be defined with respect to the 
graph as a whole. This is reasonable in those cases in 
which clusters are essential parts of the graph, which can- 
not be taken apart without seriously affecting the func- 
tioning of the system. The literature offers many global 
criteria to identify communities. In most cases they are 
indirect definitions, in which some global property of the 
graph is used in an algorithm that delivers communities 
at the end. However, there is a class of proper definitions, 
based on the idea that a graph has community structure 
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if it is different from a random graph. A random graph 
a la Erdos-Renyi (Section A. 3), for instance, is not ex- 
pected to have community structure, as any two vertices 
have the same probabihty to be adjacent, so there should 
be no preferential linking involving special groups of ver- 
tices. Therefore, one can define a null model, i. e. a graph 
which matches the original in some of its structural fea- 
tures, but which is otherwise a random graph. The null 
model is used as a term of comparison, to verify whether 
the graph at study displays community structure or not. 
The most popular null model is that proposed by New- 
man and Girvan and consists of a randomized version of 
the original graph, where edges are rewired at random, 
under the constraint that the expected degree of each 
vertex matches the degree of the vertex in the original 
graph (Newman and Girvan, 2004). This null model is 
the basic concept behind the definition of modularity, a 
function which evaluates the goodness of partitions of 
a graph into clusters. Modularity will be discussed at 
length in this review, because it has the unique privi- 
lege of being at the same time a global criterion to de- 
fine a community, a quality function and the key ingredi- 
ent of the most popular method of graph clustering. In 
the standard formulation of modularity, a subgraph is a 
community if the number of edges inside the subgraph 
exceeds the expected number of internal edges that the 
same subgraph would have in the null model. This ex- 
pected number is an average over all possible realizations 
of the null model. Several modifications of modularity 
have been proposed (Section VLB). A general class of 
null models, including modularity as a special case, has 
been designed by Reichardt and Bornholdt (Reichardt 
and Bornholdt, 2006a) (Section VLB). 



4. Definitions based on vertex similarity 

It is natural to assume that communities are groups of 
vertices similar to each other. One can compute the sim- 
ilarity between each pair of vertices with respect to some 
reference property, local or global, no matter whether 
they are connected by an edge or not. Each vertex ends 
up in the cluster whose vertices are most similar to it. 
Similarity measures are at the basis of traditional meth- 
ods, like hierarchical, partitional and spectral clustering, 
to be discussed in Sections IV. B, IV. G and IV. D. Here 
we discuss some popular measures used in the literature. 

If it is possible to embed the graph vertices in an n- 
dimensional Euclidean space, by assigning a position to 
them, one could use the distance between a pair of ver- 
tices as a measure of their similarity (it is actually a mea- 
sure of dissimilarity because similar vertices are expected 
to be close to each other). Given the two data points 
A = (oi, 02, a„) and B — (6i, 62, 6„), one could use 
any norm L^, like the Euclidean distance (L2-norm), 
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the Manhattan distance (Li-norm) 
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and the Loo-norm 
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Another popular spatial measure is the cosine similarity, 
defined as 



PAB = arccos 



a- b 



(6) 




where a-b is the dot product of the vectors a and b. The 
variable pab is defined in the range [0,7r). 

If the graph cannot be embedded in space, the sim- 
ilarity must be necessarily inferred from the adjacency 
relationships between vertices. A possibility is to define 
a distance (Burt, 1976; Wasscrman and Faust, 1994) be- 
tween vertices like 



(7) 



where A is the adjacency matrix. This is a dissimilar- 
ity measure, based on the concept of structural equiv- 
alence (F.Lorrain and White, 1971). Two vertices are 
structurally equivalent if they have the same neighbors, 
even if they are not adjacent themselves. If i and j are 
structurally equivalent, dij = 0. Vertices with large de- 
gree and different neighbours are considered very "far" 
from each other. Alternatively, one could measure the 
overlap between the neighborhoods r(i) and T{j) of ver- 
tices i and j, given by the ratio between the intersection 
and the union of the neighborhoods, i. e. 



\mnm\ 

|r(*)uF(j)r 



(8) 



Another measure related to structural equivalence is the 
Pearson correlation between columns or rows of the ad- 
jacency matrix. 



Efe(A:fc~M»)(^jfc-/ij) 



(9) 



where the averages pi — {^jAij)/n and the variances 

An alternative measure is the number of edge- (or 
vertex-) independent paths between two vertices. Inde- 
pendent paths do not share any edge (vertex), and their 
number is related to the maximum flow that can be con- 
veyed between the two vertices under the constraint that 
each edge can carry only one unit of flow (max-flow/min- 
cut theorem (Elias et ai, 1956)). The maximum flow 
can be calculated in a time 0{m), for a graph with m 
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edges, using techniques like the augmenting path algo- 
rithm (Ahuja et at, 1993). Similarly, one could consider 
all paths running between two vertices. In this case, there 
is the problem that the total number of paths is infinite, 
but this can be avoided if one performs a weighted sum 
of the number of paths. For instance, paths of length I 
can be weighted by the factor a', with a < 1. Another 
possibility, suggested by Estrada and Hatano (Estrada 
and Hatano, 2008, 2009), is to weigh paths of length I 
with the inverse factorial In both cases, the contri- 
bution of long paths is strongly suppressed and the sum 
converges. 

Another important class of measures of vertex similar- 
ity is based on properties of random walks on graphs. 
One of this properties is the commute-time between a 
pair of vertices, which is the average number of steps 
needed for a random walker, starting at either vertex, 
to reach the other vertex for the first time and to come 
back to the starting vertex. Saerens and coworkers (Fouss 
and Renders, 2007; Saerens et ai, 2004; Yen et ai, 2007, 
2009) have extensively studied and used the commute- 
time (and variants thereof) as (dis)similarity measure: 
the larger the time, the farther (less similar) the vertices. 
The commute-time is closely related (Chandra et ai, 
1989) to the resistance distance introduced by Klein and 
Randic (Klein and Randic, 1993), expressing the effective 
electrical resistance between two vertices if the graph is 
turned into a resistor network. White and Smyth (White 
and Smyth, 2003) and Zhou (Zhou, 2003a) used instead 
the average first passage time, i. e. the average number 
of steps needed to reach for the first time the target ver- 
tex from the source. Harel and Koren (Harel and Koren, 
2001) proposed to build measures out of quantities like 
the probability to visit a target vertex in no more than 
a given number of steps after it leaves a source vertex* 
and the probability that a random walker starting at a 
source visits the target exactly once before hitting the 
source again. Another quantity used to define similarity 
measures is the escape probability, defined as the prob- 
ability that the walker reaches the target vertex before 
coming back to the source vertex (Palmer and Falout- 
sos, 2003; Tong et at, 2008). The escape probability is 
related to the effective conductance between the two ver- 
tices in the equivalent resistor network. Other authors 
have exploited properties of modified random walks. For 
instance, the algorithm by Gori and Pucci (Gori and 
Pucci, 2007) and that by Tong et al. (Tong et al, 2008) 
used similarity measures derived from Google's PageR- 
ank process (Brin and Page, 1998). 



* In the clustering method by Latapy and Pons (Latapy and Pons, 
2005) (discussed in Section VIII. B) and in a recent analysis by 
Nadler et al. (Nadler et al, 2006), one defined a dissimilarity 
measure called "diffusion distance", which is derived from the 
probability that the walker visits the target after a fixed number 
of steps. 



C. Partitions 

1. Basics 

A partition is a division of a graph in clusters, such that 
each vertex belongs to one cluster. As we have seen in 
Section II, in real systems vertices may be shared among 
different communities. A division of a graph into over- 
lapping (or fuzzy) communities is called cover. 

The number of possible partitions in k clusters of a 
graph with n vertices is the Stirling number of the sec- 
ond kind S{n,k) (Andrews, 1976). The total number 
of possible partitions is the n-th Bell number B„ — 
^^^QS'(n, fc) (Andrews, 1976). In the limit of large n, 
En has the asymptotic form (Lovasz, 1993) 

En ~ ^[A(n)]"+i/2e^(")-"-i, (10) 



where A(n) = e^^") = n/W{n), W{n) being the Lam- 
bert W function (Polya and Szego, 1998). Therefore, En 
grows faster than exponentially with the graph size n, 
which means that an enumeration and/or evaluation of 
all partitions of a graph is impossible, unless the graph 
consists of very few vertices. 

Partitions can be hierarchically ordered, when the 
graph has different levels of organization/structure at 
different scales. In this case, clusters display in turn 
community structure, with smaller communities inside, 
which may again contain smaller communities, and so on 
(Fig. 7). As an example, in a social network of children 
living in the same town, one could group the children 
according to the schools they attend, but within each 
school one can make a subdivision into classes. Hier- 
archical organization is a common feature of many real 
networks, where it is revealed by a peculiar scaling of the 
clustering coefficient for vertices having the same degree 
k, when plotted as a function of k (Ravasz and Barabasi, 
2003; Ravasz et al., 2002). A natural way to represent 
the hierarchical structure of a graph is to draw a dendro- 
gram, like the one illustrated in Fig. 8. Here, partitions 
of a graph with twelve vertices are shown. At the bot- 
tom, each vertex is its own module (the "leaves" of the 
tree). By moving upwards, groups of vertices are suc- 
cessively aggregated. Mergers of communities are repre- 
sented by horizontal lines. The uppermost level repre- 
sents the whole graph as a single community. Cutting 
the diagram horizontally at some height, as shown in the 
figure (dashed line), displays one partition of the graph. 
The diagram is hierarchical by construction: each com- 
munity belonging to a level is fully included in a commu- 
nity at a higher level. Dendrograms are regularly used in 
sociology and biology. The technique of hierarchical clus- 
tering, described in Section IV. B, lends itself naturally to 
this kind of representation. 
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FIG. 7 Schematic example of a hierarchical graph. Sixteen modules with 32 vertices each clearly form four larger clusters. All 
vertices have degree 64. Reprinted figure with permission from Ref. (Lancichinetti et al., 2009). ©2009 by lOP Publishing. 
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FIG. 8 A dendrogram, or hierarchical tree. Horizontal 
cuts correspond to partitions of the graph in communities. 
Reprinted figure with permission from Ref. (Newman and Gir- 
van, 2004). ©2004 by the American Physical Society. 



2. Quality functions: modularity 

Reliable algorithms are supposed to identify good par- 
titions. But what is a good clustering? In order to dis- 
tinguish between "good" and "bad" clusterings, it would 
be useful to require that partitions satisfy a set of basic 
properties, intuitive and easy to agree upon. In the wider 
context of data clustering, this issue has been studied by 
Jon Kleinberg (Klcinbcrg, 2002), who has proved an im- 



portant impossibility theorem. Given a set S of points, a 
distance function d is defined, which is positive definite 
and symmetric (the triangular inequality is not explicitly 
required) . One wishes to find a clustering / based on the 
distances between the points. Kleinberg showed that no 
clustering satisfies at the same time the three following 
properties: 

1. Scale-invariance: given a constant a, multiplying 
any distance function d by a yields the same clus- 
tering. 

2. Richness: any possible partition of the given point 
set can be recovered if one chooses a suitable dis- 
tance function d. 

3. Consistency: given a partition, any modification of 
the distance function that does not decrease the dis- 
tance between points of different clusters and that 
does not increase the distance between points of the 
same cluster, yields the same clustering. 

The theorem cannot be extended to graph clustering be- 
cause the distance function cannot be in general defined 
for a graph which is not complete. For weighted com- 
plete graphs, like correlation matrices (Tumminello et ai, 
2008), it is often possible to define a distance function. 
On a generic graph, except for the first property, which 
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does not make sense without a distance function^, the 
other two are quite well defined. The property of richness 
implies that, given a partition, one can set edges between 
the vertices in such a way that the partition is a natural 
outcome of the resulting graph (e.g., it could be achieved 
by setting edges only between vertices of the same clus- 
ter). Consistency here implies that deleting inter-cluster 
edges and adding intra-cluster edges yields the same par- 
tition. 

Many algorithms are able to identify a subset of mean- 
ingful partitions, ideally one or just a few, whereas some 
others, like techniques based on hierarchical clustering 
(Section IV. B), deliver a large number of partitions. That 
does not mean that the partitions found are equally good. 
Therefore it is helpful (sometimes even necessary) to have 
a quantitative criterion to assess the goodness of a graph 
partition. A quality function is a function that assigns a 
number to each partition of a graph. In this way one can 
rank partitions based on their score given by the quality 
function. Partitions with high scores are "good" , so the 
one with the largest score is by definition the best. Nev- 
ertheless, one should keep in mind that the question of 
when a partition is better than another one is ill-posed, 
and the answer depends on the specific concept of com- 
munity and/or quality function adopted. 

A quality function Q is additive if there is an elemen- 
tary function q such that, for any partition P of a graph 

Q{V) = J2 9(C), (11) 
cev 

where C is a generic cluster of partition V. Eq. 11 states 
that the quality of a partition is given by the sum of the 
qualities of the individual clusters. The function q{C) 
could be any of the cluster fitness functions discussed 
in Section III.B.2, for instance. Most quality functions 
used in the literature are additive, although it is not a 
necessary requirement. 

An example of quality function is the performance P, 
which counts the number of correctly "interpreted" pairs 
of vertices, i. e. two vertices belonging to the same com- 
munity and connected by an edge, or two vertices be- 
longing to different communities and not connected by 
an edge. The definition of performance, for a partition 
P, is 

p.^^ ^ liihj) e E,C, = C,}\ + |{(»,j) ^ E,C, + fill 
^ ' n(n~l)/2 

(12) 

By definition, < P^V) < 1. Another example is cov- 
erage, i. e. the ratio of the number of intra-community 
edges by the total number of edges: by definition, an 
ideal cluster structure, where the clusters are discon- 
nected from each other, yields a coverage of 1, as all 
edges of the graph fall within clusters. 



^ The traditional shortest-path distance between vertices is not 
suitable here, as it is integer by definition. 



The most popular quality function is the modularity 
of Newman and Girvan (Newman and Girvan, 2004). It 
is based on the idea that a random graph is not expected 
to have a cluster structure, so the possible existence of 
clusters is revealed by the comparison between the actual 
density of edges in a subgraph and the density one would 
expect to have in the subgraph if the vertices of the graph 
were attached regardless of community structure. This 
expected edge density depends on the chosen null model, 
i. e. a copy of the original graph keeping some of its 
structural properties but without community structure. 
Modularity can then be written as follows 

Q = ^Y.^A,,-P,,)5{C,,C,), (13) 

where the sum runs over all pairs of vertices, A is the 
adjacency matrix, m the total number of edges of the 
graph, and represents the expected number of edges 
between vertices i and j in the null model. The (5-function 
yields one if vertices i and j are in the same community 
[Ci = Cj), zero otherwise. The choice of the null model 
graph is in principle arbitrary, and several possibilities 
exist. For instance, one could simply demand that the 
graph keeps the same number of edges as the original 
graph, and that edges are placed with the same proba- 
bility between any pair of vertices. In this case (Bernoulli 
random graph), the null model term in Eq. 13 would be a 
constant (i. e. Pij = p ~ 2m/[n{n — 1)], Vz, j). However 
this null model is not a good descriptor of real networks, 
as it has a Poissonian degree distribution which is very 
different from the skewed distributions found in real net- 
works. Due to the important implications that broad de- 
gree distributions have for the structure and function of 
real networks (Albert and Barabasi, 2002; Barrat et ai, 
2008; Boccalctti et at, 2006; Dorogovtsev and Mendes, 
2002; Newman, 2003; Pastor-Satorras and Vespignani, 
2004), it is preferable to go for a null model with the 
same degree distribution of the original graph. The stan- 
dard null model of modularity imposes that the expected 
degree sequence (after averaging over all possible configu- 
rations of the model) matches the actual degree sequence 
of the graph. This is a stricter constraint than merely 
requiring the match of the degree distributions, and is 
essentially equivalent ^ to the configuration model, which 
has been subject of intense investigations in the recent 
literature on networks (Luczak, 1992; MoUoy and Reed, 
1995). In this null model, a vertex could be attached to 
any other vertex of the graph and the probability that 
vertices i and j, with degrees ki and kj, are connected. 



The difference is that the configuration model maintains the 
same degree sequence of the original graph for each realization, 
whereas in the null model of modularity the degree sequence of a 
realization is in general different, and only the average/expected 
degree sequence coincides with that of the graph at hand. The 
two models are equivalent in the limit of infinite graph size. 
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can be calculated without problems. In fact, in order to 
form an edge between i and j one needs to join two stubs 
(i. e. half-edges), incident with i and j. The probability 
Pi to pick at random a stub incident with i is ki/2m, as 
there are ki stubs incident with i out of a total of 2m. 
The probability of a connection between i and j is then 
given by the product PiPj, since edges are placed inde- 
pendently of each other. The result is kikj /Aw? , which 
yields an expected number Pij = 2mpiPj = kikj/2m of 
edges between i and j. So, the final expression of modu- 
larity reads 



1 

2m 



E 



Ai 



2m 



5{Ci, Cj). 



(14) 



Since the only contributions to the sum come from vertex 
pairs belonging to the same cluster, we can group these 
contributions together and rewrite the sum over the ver- 
tex pairs as a sum over the clusters 
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(15) 



Here, Uc is the number of clusters, Ic the total number of 
edges joining vertices of module c and dc the sum of the 
degrees of the vertices of c. In Eq. 15, the first term of 
each summand is the fraction of edges of the graph inside 
the module, whereas the second term represents the ex- 
pected fraction of edges that would be there if the graph 
were a random graph with the same expected degree for 
each vertex. 

A nice feature of modularity is that it can be equiva- 
lently expressed both in terms of the intra-cluster edges, 
as in Eq. 15, and in terms of the inter-cluster edges (Djid- 
jev, 2007). In fact, the maximum of modularity can be 
expressed as 
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(16) 



In the last expression jCutpj = rn—J2c^i is the number 
of inter-cluster edges of partition V, and ExCutp = m — 
J2^=i Ex(/c) is the expected number of inter-cluster edges 
of the partition in modularity's null model. 

According to Eq. 15, a subgraph is a module if the 
corresponding contribution to modularity in the sum is 
positive. The more the number of internal edges of the 
cluster exceeds the expected number, the better defined 
the community. So, large positive values of the modular- 
ity indicate good partitions^. The maximum modularity 
of a graph generally grows if the size of the graph and/or 
the number of (well-separated) clusters increase (Good 
et at, 2009). Therefore, modularity should not be used 
to compare the quality of the community structure of 
graphs which are very different in size. The modularity 
of the whole graph, taken as a single community, is zero, 
as the two terms of the only summand in this case are 
equal and opposite. Modularity is always smaller than 
one, and can be negative as well. For instance, the parti- 
tion in which each vertex is a community is always nega- 
tive: in this case the sum runs over n terms, which are all 
negative as the first term of each summand is zero. This 
is a nice feature of the measure, implying that, if there 
are no partitions with positive modularity, the graph has 
no community structure. On the contrary, the existence 
of partitions with large negative modularity values may 
hint to the existence of subgroups with very few internal 
edges and many edges lying between them (multipartite 
structure) (Newman, 2006a). Holmstrom et al. (Holm- 
strom et ai, 2009) have shown that the distribution of 
modularity values across the partitions of various graphs, 
real and artificial (including random graphs with no ap- 
parent community structure), has some stable features, 
and that the most likely modularity values correspond to 
partitions in clusters of approximately equal size. 

Modularity has been employed as quality function in 
many algorithms, like some of the divisive algorithms 
of Section V. In addition, modularity optimization is it- 
self a popular method for community detection (see Sec- 
tion VI. A). Modularity also allows to assess the stability 
of partitions (Massen and Doyc, 2006) (Section XIV), 
it can be used to design layouts for graph visualiza- 
tion (Noack, 2009) and to perform a sort of renormaliza- 
tion of a graph, by transforming a graph into a smaller 
one with the same community structure (Arenas et ai, 
2007). 



where max-p and minp indicates the maximum and 
the minimum over all possible graph partitions V and 
Ex(Zc) = d^/Am indicates the expected number of links 
in cluster c in the null model of modularity. By adding 
and subtracting the total number of edges m of the graph 
one finally gets 
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minp(|Cutp| — ExCutp). 



IV. TRADITIONAL METHODS 
A. Graph partitioning 

The problem of graph partitioning consists in dividing 
the vertices in g groups of predefined size, such that the 



(17) 

' This is not necessarily true, as we will see in Section VI. C. 



17 




t 



FIG. 9 Graph partitioning. The dashed line shows the so- 
lution of the minimum bisection problem for the graph illus- 
trated, i. e. the partition in two groups of equal size with min- 
imal number of edges running between the groups. Reprinted 
figure with permission from Ref. (Fortunate and Castellano, 
2009). ©2009 by Springer. 



number of edges lying between the groups is minimal. 
The number of edges running between clusters is called 
cut size. Fig. 9 presents the solution of the problem for 
a graph with fourteen vertices, for .g = 2 and clusters of 
equal size. 

Specifying the number of clusters of the partition is 
necessary. If one simply imposed a partition with the 
minimal cut size, and left the number of clusters free, 
the solution would be trivial, corresponding to all ver- 
tices ending up in the same cluster, as this would yield 
a vanishing cut size. Specifying the size is also neces- 
sary, as otherwise the most likely solution of the problem 
would consist in separating the lowest degree vertex from 
the rest of the graph, which is quite uninteresting. This 
problem can be actually avoided by choosing a different 
measure to optimize for the partitioning, which accounts 
for the size of the clusters. Some of these measures will 
be briefly introduced at the end of this section. 

Graph partitioning is a fundamental issue in parallel 
computing, circuit partitioning and layout, and in the 
design of many serial algorithms, including techniques 
to solve partial differential equations and sparse linear 
systems of equations. Most variants of the graph parti- 
tioning problem are NP-hard. There are however several 
algorithms that can do a good job, even if their solutions 
are not necessarily optimal (Pothen, 1997). Many algo- 
rithms perform a bisection of the graph. Partitions into 
more than two clusters are usually attained by iterative 
bisectioning. Moreover, in most cases one imposes the 
constraint that the clusters have equal size. This prob- 
lem is called minimum bisection and is NP-hard. 

The Kernighan-Lin algorithm (Kernighan and Lin, 
1970) is one of the earliest methods proposed and is still 



frequently used, often in combination with other tech- 
niques. The authors were motivated by the problem of 
partitioning electronic circuits onto boards: the nodes 
contained in different boards need to be linked to each 
other with the least number of connections. The pro- 
cedure is an optimization of a benefit function Q, which 
represents the difference between the number of edges in- 
side the modules and the number of edges lying between 
them. The starting point is an initial partition of the 
graph in two clusters of the predefined size: such initial 
partition can be random or suggested by some informa- 
tion on the graph structure. Then, subsets consisting of 
equal numbers of vertices are swapped between the two 
groups, so that Q has the maximal increase. The sub- 
sets can consist of single vertices. To reduce the risk to 
be trapped in local maxima of Q, the procedure includes 
some swaps that decrease the function Q. After a series 
of swaps with positive and negative gains, the partition 
with the largest value of Q is selected and used as start- 
ing point of a new series of iterations. The Kernighan- 
Lin algorithm is quite fast, scaling as 0(n^log?T.) {n be- 
ing as usual the number of vertices), if only a constant 
number of swaps are performed at each iteration. The 
most expensive part is the identification of the subsets to 
swap, which requires the computation of the gains/losses 
for any pair of candidate subsets. On sparse graphs, a 
slightly different heuristic allows to lower the complex- 
ity to 0{n^). The partitions found by the procedure are 
strongly dependent on the initial configuration and other 
algorithms can do better. It is preferable to start with 
a good guess about the sought partition, otherwise the 
results are quite poor. Therefore the method is typi- 
cally used to improve on the partitions found through 
other techniques, by using them as starting configura- 
tions for the algorithm. The Kernighan-Lin algorithm 
has been extended to extract partitions in any number 
of parts (Suaris and Kedem, 1988), however the run-time 
and storage costs increase rapidly with the number of 
clusters. 

Another popular technique is the spectral bisection 
method (Barnes, 1982), which is based on the properties 
of the spectrum of the Laplacian matrix. Spectral clus- 
tering will be discussed more thoroughly in Section IV. D, 
here we focus on its application to graph partitioning. 

Every partition of a graph with n vertices in two groups 
can be represented by an index vector s, whose compo- 
nent Si is -|-1 if vertex i is in one group and —1 if it is in 
the other group. The cut size R of the partition of the 
graph in the two groups can be written as 

R = ^s^Ls, (18) 

where L is the Laplacian matrix and the transpose of 
vector s. Vector s can be written as s = aiV^, where 
Vi, i = 1, n are the eigenvectors of the Laplacian. If s 
is properly normalized, then 

R = Y.a1\, (19) 
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where Xi is the Laplacian eigenvahie corresponding to 
eigenvector v^. It is worth remarking that the sum con- 
tains at most n—1 terms, as the Laplacian has at least one 
zero eigenvalue. Minimizing R equals to the minimiza- 
tion of the sum on the right-hand side of Eq. 19. This task 
is still very hard. However, if the second lowest eigenvec- 
tor A2 is close enough to zero, a good approximation of 
the minimum can be attained by choosing s parallel to 
the corresponding eigenvector V2 , which is called Fiedler 
vector (Fiedler, 1973): this would reduce the sum to A2, 
which is a small number. But the index vector cannot 
be perfectly parallel to V2 by construction, because all 
its components are equal in modulus, whereas the com- 
ponents of V2 are not. The best choice is to match the 
signs of the components. So, one can set Sj = -1-1 (—1) 
if v| > (< 0). It may happen that the sizes of the two 
corresponding groups do not match the predefined sizes 
one wishes to have. In this case, if one aims at a split in 
ni and rt2 — n — ni vertices, the best strategy is to order 
the components of the Fiedler vector from the lowest to 
the largest values and to put in one group the vertices 
corresponding to the first rii components from the top 
or the bottom, and the remaining vertices in the second 
group. This procedure yields two partitions: the better 
solution is naturally the one that gives the smaller cut 
size. 

The spectral bisection method is quite fast. The first 
eigenvectors of the Laplacian can be computed by using 
the Lanczos method (Lanczos, 1950). The time required 
to compute the first k eigenvectors of a matrix with the 
Lanczos method depends on the size of the eigengap 
\Xk+i — Afcl (Golub and Loan, 1989). If the eigenval- 
ues Afc+i and are well separated, the running time of 
the algorithm is much shorter than the time required to 
calculate the complete set of eigenvectors, which scales as 
0{n^). The method gives in general good partitions, that 
can be further improved by applying the Kernighan-Lin 
algorithm. 

The well known max-flow min-cut theorem by Ford 
and Fulkerson (Ford and Fulkcrson, 1956) states that the 
minimum cut between any two vertices s and i of a graph, 
i. e. any minimal subset of edges whose deletion would 
topologically separate s from t, carries the maximum flow 
that can be transported from s to t across the graph. In 
this context edges play the role of water pipes, with a 
given carrying capacity (e.g. their weights), and vertices 
the role of pipe junctions. This theorem has been used 
to determine minimal cuts from maximal flows in clus- 
tering algorithms. There are several efficient routines to 
compute maximum flows in graphs, like the algorithm 
of Goldberg and Tarjan (Goldberg and Tarjan, 1988). 
Flake et al. (Flake et al, 2000; Flake et ai, 2002) have 
recently used maximum flows to identify communities in 
the graph of the World Wide Web. The web graph is 
directed but for the purposes of the calculation Flake et 
at. treated the edges as undirected. Web communities 
are defined to be "strong" (LS-sets), i. e. the internal de- 
gree of each vertex must not be smaller than its external 



degree (Radicchi et ai, 2004). An artificial sink t is added 
to the graph and one calculates the maximum flows from 
a source vertex s to the sink t: the corresponding mini- 
mum cut identifies the community of vertex s, provided s 
shares a sufficiently large number of edges with the other 
vertices of its community, otherwise one could get trivial 
separations and meaningless clusters. 

Other popular methods for graph partitioning in- 
clude level-structure partitioning, the geometric algo- 
rithm, multilevel algorithms, etc. A good description of 
these algorithms can be found in Ref. (Pothen, 1997). 

Graphs can be also partitioned by minimizing mea- 
sures that are affine to the cut size, like conductance (Bol- 
lobas, 1998). The conductance $(C) of the subgraph C 
of a graph Q is defined as 



c{c,g\c) 

min{kc, kg\c) 



(20) 



where c(C, g\C) is the cut size of C, and kc, kg\c are the 
total degrees of C and of the rest of the graph g\C, respec- 
tively. Cuts are defined only between non-empty sets, 
otherwise the measure would not be defined (as the de- 
nominator in Eq. 20 would vanish). The minimum of the 
conductance is obtained in correspondence of low values 
of the cut size and of large values for the denominator in 
Eq. 20, which peaks when the total degrees of the cluster 
and its complement are equal. In practical applications, 
especially on large graphs, close values of the total de- 
grees correspond to clusters of approximately equal size. 
The problem of finding a cut with minimal conductance 
is NP-hard (Si'ma and Schaeffer, 2006). Similar mea- 
sures are the ratio cut (Wei and Cheng, 1989) and the 
normalized cut (Shi and Malik, 1997, 2000). The ratio 
cut of a cluster C is defined as 



c{c,g\c) 

ncng\c 



(21) 



where nc and rig\c ^re the number of vertices of the two 
subgraphs. The normalized cut of a cluster C is 



cic,g\c) 
kc ' 



(22) 



where kc is again the total degree of C. As for the 
conductance, minimizing the ratio cut and the normal- 
ized cut favors partitions into clusters of approximately 
equal size, measured in terms of the number of vertices 
or edges, respectively. On the other hand, graph par- 
titioning requires preliminary assumptions on the clus- 
ter sizes, whereas the minimization of conductance, ratio 
cut and normalized cut does not. The ratio cut was in- 
troduced for circuit partitioning (Wei and Cheng, 1989) 
and its optimization is an NP-hard problem (Matula and 
Shahrokhi, 1990). The normalized cut is frequently used 
in image segmentation (Blake and Zisserman, 1987) and 
its optimization is NP-complete (Shi and Malik, 2000). 
The cut ratio and the normalized cut can be quite well 



19 



minimized via spectral clustering (Chan et al., 1993; Ha- 
gcn and Kahng, 1992) (Section IV.D). 

Algorithms for graph partitioning are not good for 
community detection, because it is necessary to provide 
as input the number of groups and in some cases even 
their sizes, about which in principle one knows nothing. 
Instead, one would like an algorithm capable to produce 
this information in its output. Besides, from the method- 
ological point of view, using iterative bisectioning to split 
the graph in more pieces is not a reliable procedure. For 
instance, a split into three clusters is necessarily obtained 
by breaking either cluster of the original bipartition in 
two parts, whereas in many cases a minimum cut parti- 
tion is obtained if the third cluster is a merger of parts 
of both initial clusters. 



B. Hierarchical clustering 

In general, very little is known about the community 
structure of a graph. It is uncommon to know the num- 
ber of clusters in which the graph is split, or other in- 
dications about the membership of the vertices. In such 
cases clustering procedures like graph partitioning meth- 
ods can hardly be of help, and one is forced to make some 
reasonable assumptions about the number and size of the 
clusters, which are often unjustified. On the other hand, 
the graph may have a hierarchical structure, i. e. may 
display several levels of grouping of the vertices, with 
small clusters included within large clusters, which are 
in turn included in larger clusters, and so on. Social net- 
works, for instance, often have a hierarchical structure 
(Section III.C.l). In such cases, one may use hierarchical 
clustering algorithms (Hastie et al, 2001), i. e. cluster- 
ing techniques that reveal the multilevel structure of the 
graph. Hierarchical clustering is very common in social 
network analysis, biology, engineering, marketing, etc. 

The starting point of any hierarchical clustering 
method is the definition of a similarity measure between 
vertices. After a measure is chosen, one computes the 
similarity for each pair of vertices, no matter if they are 
connected or not. At the end of this process, one is left 
with a new nxn matrix X, the similarity matrix. In Sec- 
tion III.B.4 we have listed several possible definitions of 
similarity. Hierarchical clustering techniques aim at iden- 
tifying groups of vertices with high similarity, and can be 
classified in two categories: 

1. Agglomerative algorithms, in which clusters are it- 
eratively merged if their similarity is sufficiently 
high; 

2. Divisive algorithms, in which clusters are iteratively 
split by removing edges connecting vertices with 
low similarity. 

The two classes refer to opposite processes: agglomera- 
tive algorithms are bottom-up, as one starts from the ver- 
tices as separate clusters (singletons) and ends up with 



the graph as a unique cluster; divisive algorithms are 
top-down as they follow the opposite direction. Divisive 
techniques have been rarely used in the past (meanwhile 
they have become more popular, see Section V), so we 
shall concentrate here on agglomerative algorithms. 

Since clusters are merged based on their mutual sim- 
ilarity, it is essential to define a measure that estimates 
how similar clusters are, out of the matrix X. This in- 
volves some arbitrariness and several prescriptions exist. 
In single linkage clustering, the similarity between two 
groups is the minimum element Xij, with i in one group 
and j in the other. On the contrary, the maximum el- 
ement Xij for vertices of different groups is used in the 
procedure of complete linkage clustering. In average link- 
age clustering one has to compute the average of the Xij . 

The procedure can be better illustrated by means of 
dendrograms (Section III.C.l), like the one in Fig. 8. 
Sometimes, stopping conditions are imposed to select a 
partition or a group of partitions that satisfy a special 
criterion, like a given number of clusters or the optimiza- 
tion of a quality function (e.g. modularity). 

Hierarchical clustering has the advantage that it does 
not require a preliminary knowledge on the number and 
size of the clusters. However, it does not provide a way 
to discriminate between the many partitions obtained by 
the procedure, and to choose that or those that better 
represent the community structure of the graph. The 
results of the method depend on the specific similarity 
measure adopted. The procedure also yields a hierarchi- 
cal structure by construction, which is rather artificial 
in most cases, since the graph at hand may not have a 
hierarchical structure at all. Moreover, vertices of a com- 
munity may not be correctly classified, and in many cases 
some vertices are missed even if they have a central role 
in their clusters (Newman, 2004a). Another problem is 
that vertices with just one neighbor are often classified 
as separated clusters, which in most cases does not make 
sense. Finally, a major weakness of agglomerative hier- 
archical clustering is that it does not scale well. If points 
are embedded in space, so that one can use the distance 
as dissimilarity measure, the computational complexity 
is O(n^) for single linkage, 0(n^ log n) for the complete 
and average linkage schemes. For graph clustering, where 
a distance is not trivially defined, the complexity can be- 
come much heavier if the calculation of the chosen simi- 
larity measure is costly. 



C. Partitional clustering 

Partitional clustering indicates another popular class 
of methods to find clusters in a set of data points. Here, 
the number of clusters is preassigned, say k. The points 
are embedded in a metric space, so that each vertex is 
a point and a distance measure is defined between pairs 
of points in the space. The distance is a measure of dis- 
similarity between vertices. The goal is to separate the 
points in k clusters such to maximize/minimize a given 
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cost function based on distances between points and/or 
from points to centroids, i. e. suitably defined positions in 
space. Some of the most used functions are listed below: 

• Minimum k- clustering. The cost function here is 
the diameter of a cluster, which is the largest dis- 
tance between two points of a cluster. The points 
are classified such that the largest of the k cluster 
diameters is the smallest possible. The idea is to 
keep the clusters very "compact" . 

• k-clustering sum. Same as minimum fc-clustering, 
but the diameter is replaced by the average distance 
between all pairs of points of a cluster. 

• k-center. For each cluster i one defines a refer- 
ence point Xi, the centroid, and computes the max- 
imum di of the distances of each cluster point from 
the centroid. The clusters and centroids are self- 
consistently chosen such to minimize the largest 
value of di. 

• k-median. Same as k-center, but the maximum dis- 
tance from the centroid is replaced by the average 
distance. 

The most popular partitional technique in the literature 
is k-means clustering (MacQueen, 1967). Here the cost 
function is the total intra-cluster distance, or squared 
error function 
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where Si indicates the subset of points of the i-th clus- 
ter and Ci its centroid. The fc-means problem can be 
simply solved with the Lloyd's algorithm (Lloyd, 1982). 
One starts from an initial distribution of centroids such 
that they are as far as possible from each other. In the 
first iteration, each vertex is assigned to the nearest cen- 
troid. Next, the centers of mass of the k clusters are 
estimated and become a new set of centroids, which al- 
lows for a new classification of the vertices, and so on. 
After a small number of iterations, the positions of the 
centroids are stable, and the clusters do not change any 
more. The solution found is not optimal, and it strongly 
depends on the initial choice of the centroids. Neverthe- 
less, Lloyd's heuristic has remained popular due to its 
quick convergence, which makes it suitable for the anal- 
ysis of large data sets. The result can be improved by 
performing more runs starting from different initial con- 
ditions, and picking the solution which yields the mini- 
mum value of the total intra-cluster distance. Extensions 
of fc-means clustering to graphs have been proposed by 
some authors (Hlaoui and Wang, 2004; Rattigan et ai, 
2007; Schenker et at, 2003). 

Another popular technique, similar in spirit to fc-means 
clustering, is fuzzy k-means clustering (Bezdek, 1981; 
Dunn, 1974). This method accounts for the fact that 
a point may belong to two or more clusters at the same 



time and is widely used in pattern recognition. The as- 
sociated cost function is 



|Xi - Cj 



(24) 



where Uij is the membership matrix, which measures the 
degree of membership of point i (with position Xi) in 
cluster j, m is a real number greater than 1 and cj is the 
center of cluster j 
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The matrix Uij is normalized so that the sum of the mem- 
berships of every point in all clusters yields 1. The mem- 
bership Uij is related to the distance of point i from the 
center of cluster j, as it is reasonable to assume that the 
larger this distance, the lower Uij. This can be expressed 
by the following relation 



1 



l|Xi-Cj|| 

l|xi-c,|| 



(26) 



The cost function Jm can be minimized by iterating 
Eqs. 25 and 26. One starts from some initial guess for 
Uij and uses Eq. 25 to compute the centers, which are 
then plugged back into Eqs. 26, and so on. The pro- 
cess stops when the corresponding elements of the mem- 
bership matrix in consecutive iterations differ from each 
other by less than a predefined tolerance. It can be shown 
that this procedure indeed delivers a local minimum of 
the cost function J„i of Eq. 24. This procedure has the 
same problems of Lloyd's algorithm for fc-means cluster- 
ing, i. e. the minimum is a local minimum, and depends 
on the initial choice of the matrix Uij . 

The limitation of partitional clustering is the same as 
that of the graph partitioning algorithms; the number of 
clusters must be specified at the beginning, the method 
is not able to derive it. In addition, the embedding in a 
metric space can be natural for some graphs, but rather 
artificial for others. 



D. Spectral clustering 

Let us suppose to have a set of n objects x\,xi, ...,Xn 
with a pairwise similarity function S defined between 
them, which is symmetric and non- negative (i. e.. 



S{xi,Xj) = S{xj,Xi) > 0, Vi,j 



L 



Spectral clus- 



tering includes all methods and techniques that partition 
the set into clusters by using the eigenvectors of matrices, 
like S itself or other matrices derived from it. In partic- 
ular, the objects could be points in some metric space, 
or the vertices of a graph. Spectral clustering consists of 
a transformation of the initial set of objects into a set of 
points in space, whose coordinates are elements of eigen- 
vectors: the set of points is then clustered via standard 
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techniques, like fc-means clustering (Section IV. C). One 
may wonder why it is necessary to cluster the points ob- 
tained through the eigenvectors, when one can directly 
cluster the initial set of objects, based on the similarity 
matrix. The reason is that the change of representation 
induced by the eigenvectors makes the cluster properties 
of the initial data set much more evident. In this way, 
spectral clustering is able to separate data points that 
could not be resolved by applying directly fc-means clus- 
tering, for instance, as the latter tends to deliver convex 
sets of points. 

The first contribution on spectral clustering was a pa- 
per by Donath and Hoffmann (Donath and Hoffman, 
1973), who used the eigenvectors of the adjacency matrix 
for graph partitions. In the same year, Fiedler (Fiedler, 
1973) realized that from the eigenvector of the second 
smallest eigenvalue of the Laplacian matrix it was pos- 
sible to obtain a bipartition of the graph with very low 
cut size, as we have explained in Section IV. A. For a 
historical survey see Ref. (Spielman and Teng, 1996). 
In this Section we shall follow the nice tutorial by von 
Luxburg (von Luxburg, 2006), with a focus on spectral 
graph clustering. The concepts and methods discussed 
below apply to both unweighted and weighted graphs. 

The Laplacian is by far the most used matrix in spec- 
tral clustering. In Section A. 2 we see that the unnormal- 
ized Laplacian of a graph with fc connected components 
has fc zero eigenvalues. In this case the Laplacian can 
be written in block-diagonal form, i. e. the vertices can 
be ordered in such a way that the Laplacian displays 
fc square blocks along the diagonal, with (some) entries 
different from zero, whereas all other elements vanish. 
Each block is the Laplacian of the corresponding sub- 
graph, so it has the trivial eigenvector with components 
(1, 1,1, ...,1,1). Therefore, there are fc degenerate eigen- 
vectors with equal non-vanishing components in corre- 
spondence of the vertices of a block, whereas all other 
components are zero. In this way, from the components 
of the eigenvectors one can identify the connected com- 
ponents of the graph. For instance, let us consider the 
n X fc matrix, whose columns are the fc eigenvectors above 
mentioned. The z-th row of this matrix is a vector with fc 
components representing vertex i of the graph. Vectors 
representing vertices in the same connected component of 
the graph coincide, and their tip lies on one of the axes of 
a fc-dimensional system of coordinates (i. e. they are all 
vectors of the form (0, 0, ...0, 1, 0, 0, 0)). So, by draw- 
ing the vertex vectors one would see fc distinct points, 
each on a different axis, corresponding to the graph com- 
ponents. 

If the graph is connected, but consists of fc subgraphs 
which are weakly linked to each other, the spectrum of 
the unnormalized Laplacian will have one zero eigen- 
value, all others being positive. Now the Laplacian can- 
not be put in block-diagonal form: even if one enumer- 
ates the vertices in the order of their cluster memberships 
(by listing first the vertices of one cluster, then the ver- 
tices of another cluster, etc.) there will always be some 



non- vanishing entries outside of the blocks. However, the 
lowest fc — 1 non-vanishing eigenvalues are still close to 
zero, and the vertex vectors of the first fc eigenvectors 
should still enable one to clearly distinguish the clusters 
in a fc-dimensional space. Vertex vectors corresponding 
to the same cluster are now not coincident, in general, but 
still rather close to each other. So, instead of fc points, 
one would observe fc groups of points, with the points of 
each group localized close to each other and far from the 
other groups. Techniques like fc-means clustering (Sec- 
tion IV. C) can then easily recover the clusters. 

The scenario we have described is expected from per- 
turbation theory (Bhatia, 1997; Stewart and Sun, 1990). 
In principle all symmetric matrices that can be put in 
block-diagonal form have a set of eigenvectors (as many 
as the blocks), such that the elements of each eigenvec- 
tor are different from zero on the vertices of a block and 
zero otherwise, just like the Laplacian. The adjacency 
matrix itself has the same property, for example. This is 
a necessary condition for the eigenvectors to be success- 
fully used for graph clustering, but it is not sufficient. In 
the case of the Laplacian, for a graph with fc connected 
components, we know that the eigenvectors correspond- 
ing to the fc lowest eigenvalues come each from one of 
the components. In the case of the adjacency matrix A 
(or of its weighted counterpart W), instead, it may hap- 
pen that large eigenvalues refer to the same component. 
So, if one takes the eigenvectors corresponding to the fc 
largest eigenvalues* , some components will be overrepre- 
sented, while others will be absent. Therefore, using the 
eigenvectors of A (or W) in spectral graph clustering is 
in general not reliable. Moreover, the elements of the 
eigenvectors corresponding to the components should be 
sufficiently far from zero. To understand why, suppose 
that we take a (symmetric, block-diagonal) matrix, and 
that one or more elements of one of the eigenvectors cor- 
responding to the connected components are very close 
to zero. If one perturbs the graph by adding edges be- 
tween different components, all entries of the perturbed 
eigenvectors will become non-zero and some may have 
comparable values as the lowest elements of the eigenvec- 
tors on the blocks. Therefore distinguishing vertices of 
different components may become a problem, even when 
the perturbation is fairly small, and misclassifications are 
likely. On the other hand, the non-vanishing elements of 
the (normalized) eigenvectors of the unnormalized Lapla- 
cian, for instance, are all equal to l/y/ui, where Ui is the 
number of vertices in the i-th component. In this way, 
there is a gap between the lowest element (here they are 
all equal for the same eigenvector) and zero. This holds as 
well for the normalized Laplacian Lrw (Section A. 2). For 
the other normalized Laplacian Lgym (Section A. 2), the 



Large eigenvalues of the adjacency matrix are the counterpart of 
the low eigenvalues of the Laplacian, since L = D — A, where D 
is the diagonal matrix whose elements are the vertex degrees. 
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non-zero elements of the eigenvectors corresponding to 
the connected components are proportional to the square 
root of the degree of the corresponding vertex. So, if de- 
grees are very different from each other, and especially if 
there are vertices with very low degree, some eigenvec- 
tor elements may be quite small. As we shall see below, 
in the context of the technique by Ng et al. (Ng et ai, 
2001), a suitable normalization procedure is introduced 
to alleviate this problem. 

Now that we have explained why the Laplacian matrix 
is particularly suitable for spectral clustering, we proceed 
with the description of three popular methods: unnor- 
malized spectral clustering and two normalized spectral 
clustering techniques, proposed by Shi and Malik (Shi 
and Malik, 1997, 2000) and by Ng et al. (Ng et al., 2001), 
respectively. 

Unnormalized spectral clustering uses the unnormal- 
ized Laplacian L. The inputs are the adjacency matrix 
A (W for weighted graphs) and the number k of clusters 
to be recovered. The first step consists in computing the 
eigenvectors corresponding to the lowest k eigenvalues of 
L. Then, one builds the n x k matrix V, whose columns 
are the k eigenvectors. The n rows of V are used to rep- 
resent the graph vertices in a fc-dimensional Euclidean 
space, through a Cartesian system of coordinates. The 
points are then grouped in k clusters by using fc-means 
clustering or similar techniques (Section IV. C). Normal- 
ized spectral clustering works in the same way. In the 
version by Shi and Malik (Shi and Mahk, 1997, 2000), 
one uses the eigenvectors of the normalized Laplacian 
Lrw (Section A. 2). In the algorithm by Ng et al. (Ng 
et al., 2001) one adopts the normalized Laplacian Lgym 
(Section A. 2). Here, however, the matrix V is normal- 
ized by dividing the elements of each row by their sum, 
obtaining a new matrix U, whose rows are then used to 
represent the vertices in space, as in the other methods. 
By doing so, it is much more unlikely that eigenvector 
components for a well-separated cluster are close to zero, 
a scenario which would make the classification of the cor- 
responding vertices problematic, as we have said above. 
However, if the graph has some vertices with low degree, 
they may still be misclassified. 

Spectral clustering is closely related to graph parti- 
tioning. Relaxed versions of the minimization of ratio cut 
and normalized cut (see Section IV. A) can be turned into 
spectral clustering problems, by following similar proce- 
dures as in spectral graph partitioning. The measure 
to minimize can be expressed in matrix form, obtain- 
ing similar expressions as for the cut size (see Eq. 18), 
with index vectors defining the partition of the graph in 
groups through the values of their entries. For instance, 
for the minimum cut bipartition of Section IV. A, there 
is only one index vector s, whose components equal ±1, 
where the signs indicate the two clusters. The relaxation 
consists in performing the minimization over all possible 
vectors s, allowing for real-valued components as well. 
This version of the problem is exactly equivalent to spec- 
tral clustering. The relaxed minimization of ratio cut 



for a partition in k clusters yields the n fc-dimensional 
vertex vectors of unnormalized spectral clustering (von 
Luxburg, 2006); for normalized cut one obtains the n fc- 
dimensional vertex vectors of normalized spectral cluster- 
ing, with the normalized Laplacian Lrw (Shi and Malik, 
1997). The problem is then to turn the resulting vectors 
into a partition of the graph, which can be done by us- 
ing techniques like fc-means clustering, as we have seen 
above. However, it is still unclear what is the relation 
between the original minimum cut problem over actual 
graph partitions and the relaxed version of it, in par- 
ticular how close one can come to the real solution via 
spectral clustering. 

Random walks on graphs are also related to spec- 
tral clustering. In fact, by minimizing the number of 
edges between clusters (properly normalized for measures 
like, e. g., ratio cut and normalized cut) one forces ran- 
dom walkers to spend more time within clusters and to 
move more rarely from one cluster to another. In par- 
ticular, unnormalized spectral clustering with the Lapla- 
cian has a natural link with random walks, because 
Lrw = I - D^^A (Section A. 2), where D^^A is the 
transfer matrix T. This has interesting consequences. 
For instance, Meila and Shi have proven that the nor- 
malized cut for a bipartition equals the total probability 
that a random walker moves from one of the clusters to 
the other in either sense (Meila and Shi, 2001). In this 
way, minimizing the normalized cut means looking for 
a partition minimizing the probability of transitions be- 
tween clusters. 

Spectral clustering requires the computation of the 
first fc eigenvectors of a Laplacian matrix. If the graph 
is large, an exact computation of the eigenvectors is 
impossible, as it would require a time 0{n^). Fortu- 
nately there are approximate techniques, like the power 
method or Krylov subspace techniques like the Lanczos 
method (Golub and Loan, 1989), whose speed depends 
on the size of the eigengap |Afe+i — Afc|, where Xk and 
Afe+i are the fc-th and (fc -I- l)-th smallest eigenvalue of 
the matrix. The larger the eigengap, the faster the con- 
vergence. In fact, the existence of large gaps between 
pairs of consecutive eigenvalues could suggest the num- 
ber of clusters of the graph, an information which is not 
delivered by spectral clustering and which has to be given 
as input. We know that, for a disconnected graph with fc 
components, the first fc eigenvalues of the Laplacian ma- 
trix (normalized or not) are zero, whether the (fc -I- l)-th 
is non-zero. If the clusters are weakly connected to each 
other, one expects that the first fc eigenvalues remain 
close to zero, and that the (fc -I- l)-th is clearly different 
from zero. By reversing this argument, the number of 
clusters of a graph could be derived by checking whether 
there is an integer fc such that the first fc eigenvalues are 
small and the (fc-f l)-th is relatively large. However, when 
the clusters are very mixed with each other, it may be 
hard to identify significant gaps between the eigenvalues. 

The last issue we want to point out concerns the choice 
of the Laplacian matrix to use in the applications. If the 
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graph vertices have the same or similar degrees, there 
is no substantial difference between the unnormalized 
and the normalized Laplacians. If there are big inhomo- 
geneities among the vertex degrees, instead, the choice 
of the Laplacian considerably affects the results. In gen- 
eral, normalized Laplacians are more promising because 
the corresponding spectral clustering techniques implic- 
itly impose a double optimization on the set of partitions, 
such that the intracluster edge density is high and, at 
the same time, the intercluster edge density is low. On 
the contrary, the unnormalized Laplacian is related to 
the intercluster edge density only. Moreover, unnormal- 
ized spectral clustering does not always converge, and 
sometimes yields trivial partitions in which one or more 
clusters consist of a single vertex. Of the normalized 
Laplacians, is more reliable than Lgym because the 
eigenvectors of Lrw corresponding to the lowest eigenval- 
ues are cluster indicator vectors, i. e., they have equal 
non-vanishing entries in correspondence of the vertices 
of each cluster, and zero elsewhere, if the clusters arc 
disconnected. The eigenvectors of Lgym, instead, are ob- 
tained by (left-) multiplying those of L^^ by the matrix 
way, eigenvector components correspond- 
ing to vertices of the same cluster are no longer equal, in 
general, a complication that may induce artefacts in the 
spectral clustering procedure. 



V. DIVISIVE ALGORITHMS 

A simple way to identify communities in a graph is to 
detect the edges that connect vertices of different com- 
munities and remove them, so that the clusters get dis- 
connected from each other. This is the philosophy of 
divisive algorithms. The crucial point is to find a prop- 
erty of intercommunity edges that could allow for their 
identification. Divisive methods do not introduce sub- 
stantial conceptual advances with respect to traditional 
techniques, as they just perform hierarchical clustering 
on the graph at study (Section IV. B). The main differ- 
ence with divisive hierarchical clustering is that here one 
removes inter-cluster edges instead of edges between pairs 
of vertices with low similarity and there is no guarantee a 
priori that inter-cluster edges connect vertices with low 
similarity. In some cases vertices (with all their adja- 
cent edges) or whole subgraphs may be removed, instead 
of single edges. Being hierarchical clustering techniques, 
it is customary to represent the resulting partitions by 
means of dendrograms. 



A. The algorithm of Girvan and Newman 

The most popular algorithm is that proposed by Gir- 
van and Newman (Girvan and Newman, 2002; Newman 
and Girvan, 2004). The method is historically important, 
because it marked the beginning of a new era in the field 
of community detection and opened this topic to physi- 




FIG. 10 Edge betweenness is highest for edges connecting 
communities. In the figure, the edge in the middle has a much 
higher betweenness than all other edges, because all shortest 
paths connecting vertices of the two communities run through 
it. Reprinted figure with permission from Ref. (Fortunato and 
Castellano, 2009). ©2009 by Springer. 



cists. Here edges are selected according to the values of 
measures of edge centrality, estimating the importance of 
edges according to some property or process running on 
the graph. The steps of the algorithm are: 

1. Computation of the ccntrality for all edges; 

2. Removal of edge with largest centrality: in case 
of ties with other edges, one of them is picked at 
random; 

3. Recalculation of centralities on the running graph; 

4. Iteration of the cycle from step 2. 

Girvan and Newman focused on the concept of between- 
ness, which is a variable expressing the frequency of the 
participation of edges to a process. They considered 
three alternative definitions: geodesic edge betweenness, 
random-walk edge betweenness and current-flow edge be- 
tweenness. In the following we shall refer to them as edge 
betweenness, random-walk betweenness and current-flow 
betweenness, respectively. 

Edge betweenness is the number of shortest paths be- 
tween all vertex pairs that run along the edge. It is an 
extension to edges of the popular concept of site between- 
ness, introduced by Freeman in 1977 (Freeman, 1977) 
and expresses the importance of edges in processes like 
information spreading, where information usually flows 
through shortest paths. Historically edge betweenness 
was introduced before site betweenness in a never pub- 
lished technical report by Anthonisse (Antlionisse, 1971). 
It is intuitive that intercommunity edges have a large 
value of the edge betweenness, because many shortest 
paths connecting vertices of different communities will 
pass through them (Fig. 10). As in the calculation of 
site betweenness, if there are two or more geodesic paths 
with the same endpoints that run through an edge, the 
contribution of each of them to the betweenness of the 
edge must be divided by the multiplicity of the paths, 
as one assumes that the signal/information propagates 
equally along each geodesic path. The betweenness of all 
edges of the graph can be calculated in a time that scales 
as 0{mn), or 0{n^) on a sparse graph, with techniques 
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based on breadth-first-search (Brandos, 2001; Newman 
and Girvan, 2004; Zhou et ai, 2006). 

In the context of information spreading, one could 
imagine that signals flow across random rather than 
geodesic paths. In this case the betweenness of an edge 
is given by the frequency of the passages across the edge 
of a random walker running on the graph (random-walk 
betweenness). A random walker moving from a vertex 
follows each adjacent edge with equal probability. A pair 
of vertices is chosen at random, s and t. The walker starts 
at s and keeps moving until it hits t, where it stops. One 
computes the probability that each edge was crossed by 
the walker, and averages over all possible choices for the 
vertices s and t. It is meaningful to compute the net 
crossing probability, which is proportional to the num- 
ber of times the walk crossed the edge in one direction. 
In this way one neglects back and forth passages that 
are accidents of the random walk and tell nothing about 
the centrality of the edge. Calculation of random-walk 
betweenness requires the inversion of an n x n matrix 
(once) , followed by obtaining and averaging the flows for 
all pairs of nodes. The first task requires a time 0{n^), 
the second 0{mn^), for a total complexity 0[(to -I- 7i)n^], 
or 0{n^) for a sparse matrix. The complete calculation 
requires a time 0(ri^) on a sparse graph. 

Current-flow betweenness is defined by considering the 
graph a resistor network, with edges having unit resis- 
tance. If a voltage difference is applied between any two 
vertices, each edge carries some amount of current, that 
can be calculated by solving Kirchoff's equations. The 
procedure is repeated for all possible vertex pairs: the 
current-flow betweenness of an edge is the average value 
of the current carried by the edge. It is possible to show 
that this measure is equivalent to random- walk between- 
ness, as the voltage differences and the random walks net 
flows across the edges satisfy the same equations (New- 
man, 2005). Therefore, the calculation of current-flow 
betweenness has the same complexity 0[{m + n)n'^], or 
0{n^) for a sparse graph. 

Calculating edge betweenness is much faster than 
current-flow or random walk betweenness [O(n^) versus 
0(71^) on sparse graphs]. In addition, in practical ap- 
plications the Girvan-Newman algorithm with edge be- 
tweenness gives better results than adopting the other 
centrality measures (Newman and Girvan, 2004). Nu- 
merical studies show that the recalculation step 3 of 
Girvan-Newman algorithm is essential to detect mean- 
ingful communities. This introduces an additional factor 
m in the running time of the algorithm: consequently, 
the edge betweenness version scales as 0{m^n), or 0{n^) 
on a sparse graph. On graphs with strong community 
structure, that quickly break into communities, the re- 
calculation step needs to be performed only within the 
connected component including the last removed edge 
(or the two components bridged by it if the removal of 
the edge splits a subgraph) , as the edge betweenness of all 
other edges remains the same. This can help saving some 
computer time, although it is impossible to give estimates 



of the gain since it depends on the specific graph at hand. 
Nevertheless, the algorithm is quite slow, and applicable 
to sparse graphs with up to rt ^ 10000 vertices, with 
current computational resources. In the original version 
of Girvan-Newman's algorithm (Girvan and Newman, 
2002), the authors had to deal with the whole hierar- 
chy of partitions, as they had no procedure to say which 
partition is the best. In a successive refinement (New- 
man and Girvan, 2004), they selected the partition with 
the largest value of modularity (see Section III.C.2), a 
criterion that has been frequently used ever since. The 
method can be simply extended to the case of weighted 
graphs, by suitably generalizing the edge betweenness. 
The betweenness of a weighted edge equals the between- 
ness of the edge in the corresponding unweighted graph, 
divided by the weight of the edge (Newman, 2004) . There 
have been countless applications of the Girvan-Newman 
method: the algorithm is now integrated in well known 
libraries of network analysis programs. 

Tyler et al. proposed a modification of the Girvan- 
Newman algorithm, to improve the speed of the calcula- 
tion (Tyler et ai, 2003; Wilkinson and Huberman, 2004). 
The gain in speed was required by the analysis of graphs 
of gene co-occurrences, which are too large to be ana- 
lyzed by the algorithm of Girvan and Newman. Algo- 
rithms computing site/edge betweenness start from any 
vertex, taken as center, and compute the contribution to 
betweenness from all paths originating at that vertex; the 
procedure is then repeated for all vertices (Brandes, 2001; 
Newman and Girvan, 2004; Zhou et at, 2006). Tyler et 
al. proposed to calculate the contribution to edge be- 
tweenness only from a limited number of centers, chosen 
at random, deriving a sort of Monte Carlo estimate. Nu- 
merical tests indicate that, for each connected subgraph, 
it suffices to pick a number of centers growing as the log- 
arithm of the number of vertices of the component. For 
a given choice of the centers, the algorithm proceeds just 
like that of Girvan and Newman. The stopping criterion 
is different, though, as it does not require the calcula- 
tion of modularity on the resulting partitions, but relies 
on a particular definition of community. According to 
such definition, a connected subgraph with no vertices is 
a community if the edge betweenness of any of its edges 
does not exceed no — 1. Indeed, if the subgraph consists 
of two parts connected by a single edge, the between- 
ness value of that edge would be greater than or equal to 
uq — 1, with the equality holding only if one of the two 
parts consists of a single vertex. Therefore, the condition 
on the betweenness of the edges would exclude such sit- 
uations, although other types of cluster structures might 
still be compatible with it. In this way, in the method of 
Tyler et al., edges are removed until all connected com- 
ponents of the partition are "communities" in the sense 
explained above. The Monte Carlo sampling of the edge 
betweenness necessarily induces statistical errors. As a 
consequence, the partitions are in general different for 
different choices of the set of center vertices. However, 
the authors showed that, by repeating the calculation 
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many times, the method gives good resuhs on a network 
of gene co-occurrences (Wilkinson and Huberman, 2004), 
with a substantial gain of computer time. The technique 
has been also applied to a network of people correspond- 
ing via email (Tyler et al, 2003). In practical examples, 
only vertices lying at the boundary between communities 
may not be clearly classified, and be assigned sometimes 
to a group, sometimes to another. This is actually a nice 
feature of the method, as it allows to identify overlaps 
between communities, as well as the degree of member- 
ship of overlapping vertices in the clusters they belong 
to. The algorithm of Girvan and Newman, which is de- 
terministic, is unable to accomplish this^. Another fast 
version of the Girvan-Newman algorithm has been pro- 
posed by Rattigan et al. (Rattigan et ai, 2007). Here, 
a quick approximation of the edge betweenness values 
is carried out by using a network structure index, which 
consists of a set of vertex annotations combined with a 
distance measure (Rattigan et at, 2006). Basically one 
divides the graph into regions and computes the distances 
of every vertex from each region. In this way Rattigan et 
al. showed that it is possible to lower the complexity of 
the algorithm to 0(m), by keeping a fair accuracy in the 
estimate of the edge betweenness values. This version of 
the Girvan-Newman algorithm gives good results on the 
benchmark graphs proposed by Brandes et al. (Brandos 
et ai, 2003) (see also Section XV. A), as well as on a col- 
laboration network of actors and on a citation network. 

Chen and Yuan have pointed out that counting all pos- 
sible shortest paths in the calculation of the edge be- 
tweenness may lead to unbalanced partitions, with com- 
munities of very different size, and proposed to count only 
non-redundant paths, i. e. paths whose endpoints are 
all different from each other: the resulting betweenness 
yields better results than standard edge betweenness for 
mixed clusters on the benchmark graphs of Girvan and 
Newman (Chen and Yuan, 2006). Holme et al. have used 
a modified version of the algorithm in which vertices, 
rather than edges, are removed (Holme et at, 2003). A 
centrality measure for the vertices, proportional to their 
site betweenness, and inversely proportional to their in- 
degree, is chosen to identify boundary vertices, which 
are then iteratively removed with all their edges. This 
modification, applied to study the hierarchical organiza- 
tion of biochemical networks, is motivated by the need to 
account for reaction kinetic information, that simple site 
betweenness does not include. The indegree of a vertex is 
solely used because it indicates the number of substrates 
to a metabolic reaction involving that vertex; for the pur- 
pose of clustering the graph is considered undirected, as 
usual. 



It may happen that, at a given iteration, two or more edges of the 
graph have the same value of maximal betweenness. In this case 
one can pick any of them at random, which may lead in general 
to (slightly) different partitions at the end of the computation. 



The algorithm of Girvan and Newman is unable to 
find overlapping communities, as each vertex is assigned 
to a single cluster. Pinney and Westhead have proposed 
a modification of the algorithm in which vertices can 
be split between communities (Pinney and Westhead, 
2006). To do that, they also compute the betweenness 
of all vertices of the graph. Unfortunately the values of 
edge and site betweenness cannot be simply compared, 
due to their different normalization, but the authors re- 
marked that the two endvertices of an inter-cluster edge 
should have similar betweenness values, as the shortest 
paths crossing one of them are likely to reach the other 
one as well through the edge. So they take the edge with 
largest betweenness and remove it only if the ratio of the 
betweenness values of its endvertices is between a and 
1/a, with a = 0.8. Otherwise, the vertex with highest 
betweenness (with all its adjacent edges) is temporarily 
removed. When a subgraph is split by vertex or edge 
removal, all deleted vertices belonging to that subgraph 
are "copied" in each subcomponent, along with all their 
edges. Gregory (Gregory, 2007) has proposed a similar 
approach, named CONGA (Cluster Overlap Newnian- 
Girvan Algorithm), in which vertices are split among 
clusters if their site betweenness exceeds the maximum 
value of the betweenness of the edges. A vertex is split 
by assigning some of its edges to one of its duplicates, 
and the rest to the other. There are several possibilities 
to do that, Gregory proposed to go for the split that 
yields the maximum of a new centrality measure, called 
split betweenness, which is the number of shortest paths 
that would run between two parts of a vertex if the latter 
were split. The method has a worst-case complexity 
0{m^), or 0{n^) on a sparse graph, like the algorithm 
of Girvan and Newman. The code can be found at 
http : //www . cs . bris . ac . uk/^steve/networks/ index . 
html. 



B. Other methods 

Another promising track to detect inter-cluster edges 
is related to the presence of cycles, i. e. closed non- 
intersecting paths, in the graph. Communities are char- 
acterized by a high density of edges, so it is reasonable 
to expect that such edges form cycles. On the contrary, 
edges lying between communities will hardly be part of 
cycles. Based on this intuitive idea, Radicchi et al. pro- 
posed a new measure, the edge clustering coefficient, such 
that low values of the measure are likely to correspond 
to intercommunity edges (Radicchi et al., 2004). The 
edge clustering coefficient generalizes to edges the notion 
of clustering coefficient introduced by Watts and Stro- 
gatz for vertices (Watts and Strogatz, 1998) (Fig. 11). 
The clustering coefficient of a vertex is the number of 
triangles including the vertex divided by the number of 
possible triangles that can be formed (Section A.l). The 
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FIG. 11 Schematic illustration of the edge clustering coef- 
ficient introduced by Radicchi et al. (Radicchi et al, 2004). 
The two grey vertices have five and six other neighbors, re- 
spectively. Of the five possible triangles based on the edge 
connecting the grey vertices, three are actually there, yield- 
ing an edge clustering coefficient = 3/5. Courtesy by F. 
Radicchi. 



edge clustering coefRcient is defined as 



where i and j are the extremes of the edge, z^^^ the 

number of cycles of length g built upon edge ij and s|^^^ 
the possible number of cycles of length g that one could 
build based on tlie existing edges of i, j and their neigh- 
bors. The number of actual cycles in the numerator is 
augmented by 1 to enable a ranking among edges with- 
out cycles, which would all yield a coefficient c'^j equal 
to zero, independently of the degrees of the extremes 
i and j and their neighbors. Usually, cycles of length 
(7 = 3 (triangles) or 4 are considered. The measure is 
(anti) correlated with edge betweenness: edges with low 
edge clustering coefficient usually have high betweenness 
and vice versa, although the correlation is not perfect. 
The method works as the algorithm by Girvan and New- 
man. At each iteration, the edge with smallest clustering 
coefficient is removed, the measure is recalculated again, 
and so on. If the removal of an edge leads to a split 
of a subgraph in two parts, the split is accepted only 
if both clusters are LS-sets ("strong") or "weak" com- 
munities (see Section III.B.2). The verification of the 
community condition on the clusters is performed on the 
full adjacency matrix of the initial graph. If the condi- 
tion were satisfied only for one of the two clusters, the 
initial subgraph may be a random graph, as it can be 
easily seen that by cutting a random graph a la Erdos 
and Renyi in two parts, the larger of them is a strong (or 
weak) community with very high probability, whereas the 
smaller part is not. Enforcing the community condition 
on both clusters, it is more likely that the subgraph to 



be split indeed has a cluster structure. Therefore, the al- 
gorithm stops when all clusters produced by the edge re- 
movals are communities in the strong or weak sense, and 
further splits would violate this condition. The authors 
suggested to use the same stopping criterion for the al- 
gorithm of Girvan and Newman, to get structurally well- 
defined clusters. Since the edge clustering coefficient is a 
local measure, involving at most an extended neighbor- 
hood of the edge, it can be calculated very quickly. The 
running time of the algorithm to completion is 0(m^/n^), 
or 0{n^) on a sparse graph, if g is small, so it is much 
shorter than the running time of the Girvan-Newman 
method. The recalculation step becomes slow if g is not 
so small, as in this case the number of edges whose co- 
efficient needs to be recalculated may reach a sizeable 
fraction of the edges of the graph; likewise, counting the 
number of cycles based on one edge becomes lengthier. 
li g ^ 2d, where d is the diameter of the graph (which 
is usually a small number for real networks), the cycles 
span the whole graph and the measure becomes global 
and no more local. The computational complexity in 
this case exceeds that of the algorithm of Girvan and 
Newman, but it can come close to it for practical pur- 
poses even at lower values of g. So, by tuning g one can 
smoothly interpolate between a local and a global cen- 
trality measure. The software of the algorithm can be 
found in http://filrad.homelinux.org/Data/. In a 
successive paper (C. Castellano et al., 2004) the authors 
extended the method to the case of weighted networks, 
by modifying the edge clustering coefficient of Eq. 27, 
in that the number of cycles zf^ is multiplied by the 
weight of the edge ij. The definitions of strong and 
weak communities can be trivially extended to weighted 
graphs by replacing the internal/external degrees of the 
vertices/clusters with the corresponding strengths. More 
recently, the method has been extended to bipartite net- 
works (Zhang et al., 2007), where only cycles of even 
length are possible {g = 4, 6, 8, etc.). The algorithm by 
Radicchi et al. may give poor results when the graph has 
few cycles, as it happens in some social and many non- 
social networks. In this case, in fact, the edge clustering 
coefRcient is small and fairly similar for most edges, and 
the algorithm may fail to identify the bridges between 
communities. 

An alternative measure of centrality for edges is in- 
formation centrality. It is based on the concept of ef- 
ficiency (Latora and Marchiori, 2001), which estimates 
how easily information travels on a graph according to 
the length of shortest paths between vertices. The effi- 
ciency of a network is defined as the average of the in- 
verse distances between all pairs of vertices. If the ver- 
tices are "close" to each other, the efficiency is high. The 
information centrality of an edge is the relative varia- 
tion of the efficiency of the graph if the edge is removed. 
In the algorithm by Fortunato et al. (Fortunato et al., 
2004), edges are removed according to decreasing values 
of information centrality. The method is analogous to 
that of Girvan and Newman. Computing the informa- 
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tion centrality of an edge requires the calculation of the 
distances between all pairs of vertices, which can be done 
with breadth-first-search in a time 0(mn). So, in order 
to compute the information centrality of all edges one re- 
quires a time 0{m^n). At this point one removes the edge 
with the largest value of information centrality and recal- 
culates the information centrality of all remaining edges 
with respect to the running graph. Since the procedure is 
iterated until there are no more edges in the network, the 
final complexity is O(m^n), or 0{n'^) on a sparse graph. 
The partition with the largest value of modularity is cho- 
sen as most representative of the community structure of 
the graph. The method is much slower than the algo- 
rithm of Girvan and Newman. Partitions obtained with 
both techniques are rather consistent, mainly because in- 
formation centrality has a strong correlation with edge 
betweenness. The algorithm by Fortunate et al. gives 
better results when communities are mixed, i. e. with a 
high degree of interconnectedness, but it tends to isolate 
leaf vertices and small loosely bound subgraphs. 

A measure of vertex centrality based on loops, similar 
to the clustering coefficient by Watts and Strogatz (Watts 
and Strogatz, 1998), has been introduced by Vragovic 
and Louis (Vragovic and Louis, 2006). The idea is that 
neighbors of a vertex well inside a community are "close" 
to each other, even in the absence of the vertex, due to 
the high density of intra-cluster edges. Suppose that j 
and k are neighbors of a vertex i: djf^/i is the length of 
a shortest path between j and k, Hi is removed from 
the graph. Naturally, the existence of alternative paths 
to j — i — k implies the existence of loops in the graph. 
Vragovic and Louis defined the loop coefficient of i as 
the average of l/dj^n over all pairs of neighbors of i, 
somewhat reminding of the concept of information cen- 
trality used in the method by Fortunato et al. (Fortunato 
et al., 2004). High values of the loop coefficient are likely 
to identify core vertices of communities, whereas low val- 
ues correspond to vertices lying at the boundary between 
communities. Clusters are built around the vertices with 
highest values of the loop coefficient. The method has 
time complexity 0(nm); its results are not so accurate, 
as compared to popular clustering techniques. 



VI. MODULARITY-BASED METHODS 

Newman-Girvan modularity Q (Section III.C.2), orig- 
inally introduced to define a stopping criterion for the 
algorithm of Girvan and Newman, has rapidly become 
an essential element of many clustering methods. Mod- 
ularity is by far the most used and best known qual- 
ity function. It represented one of the first attempts to 
achieve a first principle understanding of the clustering 
problem, and it embeds in its compact form all essential 
ingredients and questions, from the definition of commu- 
nity, to the choice of a null model, to the expression of the 
"strength" of communities and partitions. In this section 
we shall focus on all clustering techniques that require 



modularity, directly and/or indirectly. We will examine 
fast techniques that can be used on large graphs, but 
which do not find good optima for the measure (Blon- 
del et al, 2008; Clauset et al, 2004; Danon et al, 2006; 
Du et al, 2007; Mei et al, 2009; Newman, 2004b; Noack 
and Rotta, 2009; Pujol et al, 2006; Schuetz and Cafiiscli, 
2008a,b; Wakita and Tsurumi, 2007; Xiang et al, 2009); 
more accurate methods, which are computationally de- 
manding (Guimera et al, 2004; Massen and Doye, 2005; 
Modus et al, 2005); algorithms giving a good tradeoff be- 
tween high accuracy and low complexity (Duch and Are- 
nas, 2005; Lehmann and Hansen, 2007; Newman, 2006b; 
Ruan and Zhang, 2007; White and Smyth, 2005). We 
shall also point out other properties of modularity, dis- 
cuss some extensions/modifications of it, as well as high- 
light its limits. 



A. Modularity optimization 

By assumption, high values of modularity indicate 
good partitions^*'. So, the partition corresponding to its 
maximum value on a given graph should be the best, or 
at least a very good one. This is the main motivation 
for modularity maximization, by far the most popular 
class of methods to detect communities in graphs. An 
exhaustive optimization of Q is impossible, due to the 
huge number of ways in which it is possible to partition 
a graph, even when the latter is small. Besides, the true 
maximum is out of reach, as it has been recently proved 
that modularity optimization is an NP-complete prob- 
lem (Brandos et al, 2006), so it is probably impossible 
to find the solution in a time growing polynomially with 
the size of the graph. However, there are currently sev- 
eral algorithms able to find fairly good approximations 
of the modularity maximum in a reasonable time. 



1. Greedy techniques 

The first algorithm devised to maximize modularity 
was a greedy method of Newman (Newman, 2004b). It 
is an agglomerative hierarchical clustering method, where 
groups of vertices are successively joined to form larger 
communities such that modularity increases after the 
merging. One starts from n clusters, each containing 
a single vertex. Edges are not initially present, they are 
added one by one during the procedure. However, the 
modularity of partitions explored during the procedure 
is always calculated from the full topology of the graph, 
as we want to find the modularity maximum on the space 
of partitions of the full graph. Adding a first edge to the 
set of disconnected vertices reduces the number of groups 
from n to n — 1, so it delivers a new partition of the graph. 



This is not true in general, as we shall discuss in Section VI. C. 
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The edge is chosen such that this partition gives the max- 
imum increase (minimum decrease) of modularity with 
respect to the previous configuration. All other edges 
are added based on the same principle. If the insertion 
of an edge does not change the partition, i. e. the edge 
is internal to one of the clusters previously formed, mod- 
ularity stays the same. The number of partitions found 
during the procedure is n, each with a different number 
of clusters, from n to 1. The largest value of modularity 
in this subset of partitions is the approximation of the 
modularity maximum given by the algorithm. At each 
iteration step, one needs to compute the variation AQ of 
modularity given by the merger of any two communities 
of the running partition, so that one can choose the best 
merger. However, merging communities between which 
there are no edges can never lead to an increase of Q, 
so one has to check only the pairs of communities which 
are connected by edges, of which there cannot be more 
than TO. Since the calculation of each AQ can be done 
in constant time, this part of the calculation requires a 
time 0{m). After deciding which communities are to be 
merged, one needs to update the matrix Cij expressing 
the fraction of edges between clusters i and j of the run- 
ning partition (necessary to compute Q), which can be 
done in a worst-case time 0{n). Since the algorithm re- 
quires n — 1 iterations (community mergers) to run to 
completion, its complexity is 0{{m -\- n)n), or O(n^) on 
a sparse graph, so it enables one to perform a clustering 
analysis on much larger networks than the algorithm of 
Girvan and Newman (up to an order of 100000 vertices 
with current computers). In a later paper (Clausct et ai, 
2004), Clauset et al. pointed out that the update of the 
matrix e^- in Newman's algorithm involves a large num- 
ber of useless operations, due to the sparsity of the adja- 
cency matrix. This operation can be performed more ef- 
ficiently by using data structures for sparse matrices, like 
max-heaps, which rearrange the data in the form of bi- 
nary trees. Clauset et al. maintained the matrix of mod- 
ularity variations AQ^-, which is also sparse, a max- heap 
containing the largest elements of each row of the matrix 
AQij as well as the labels of the corresponding commu- 
nities, and a simple array whose elements are the sums 
of the elements of each row of the old matrix e^-. The 
optimization of modularity can be carried out using these 
three data structures, whose update is much quicker than 
in Newman's technique. The complexity of the algorithm 
is 0(TOci log n), where d is the depth of the dendrogram 
describing the successive partitions found during the ex- 
ecution of the algorithm, which grows as log n for graphs 
with a strong hierarchical structure. For those graphs, 
the running time of the method is then 0{n log^ n), which 
allows to analyse the community structure of very large 
graphs, up to 10^ vertices. The greedy optimization of 
Clauset et al. is currently one of the few algorithms that 
can be used to estimate the modularity maximum on such 
large graphs. The code can be freely downloaded from 
http : //cs . unm . edu/ ^aaron/ research/f astmodulari 
ty .htm. 



This greedy optimization of modularity tends to form 
quickly large communities at the expenses of small ones, 
which often yields poor values of the modularity maxima. 
Danon et al. suggested to normalize the modularity 
variation AQ produced by the merger of two communi- 
ties by the fraction of edges incident to one of the two 
communities, in order to favor small clusters (Danon 
et at, 2006). This trick leads to better modularity 
optima as compared to the original recipe of Newman, 
especially when communities are very different in size. 
Wakita and Tsurumi (Wakita and Tsurumi, 2007) have 
noticed that, due to the bias towards large communities, 
the fast algorithm by Clauset et al. is inefficient, because 
it yields very unbalanced dendrograms, for which the 
relation d ^ log n does not hold, and as a consequence 
the method often runs at its worst-case complexity. To 
improve the situation they proposed a modification in 
which, at each step, one seeks the community merger 
delivering the largest value of the product of the modu- 
larity variation AQ times a factor (consolidation ratio), 
that peaks for communities of equal size. In this way 
there is a tradeoff between the gain in modularity and 
the balance of the communities to merge, with a big gain 
in the speed of the procedure, that enables the analysis 
of systems with up to 10^ vertices. Interestingly, this 
modification often leads to better modularity maxima 
than those found with the version of Clauset et al., at 
least on large social networks. The code can be found at 
http : //www. is . titech. ac . jp/^wakita/en/ software 
/community-analysis-sof tware/. Another trick to 
avoid the formation of large communities was proposed 
by Schuetz and Caflisch and consists in allowing for the 
merger of more community pairs, instead of one, at each 
iteration (Schuetz and Caflisch, 2008a,b). This generates 
several "centers" around which communities are formed, 
which grow simultaneously so that a condensation into a 
few large clusters is unlikely. This modified version of the 
greedy algorithm is combined with a simple refinement 
procedure in which single vertices are moved to the neigh- 
boring community that yields the maximum increase of 
modularity. The method has the same complexity of 
the fast optimization by Clauset et al., but comes closer 
to the modularity maximum. The software is available at 
http: //www.biochem-caf lisch.uzh. ch/public/5/net 
work-clusterization-algorithm.html. The accu- 
racy of the greedy optimization can be significantly 
improved if the hierarchical agglomeration is started 
from some reasonable intermediate configuration, rather 
than from the individual vertices (Du et ai, 2007; 
Pujol et ai, 2006). Xiang et al. suggested to start 
from a configuration obtained by merging the original 
isolated vertices into larger subgraphs, according to the 
values of a measure of topological similarity between 
subgraphs (Xiang et ai, 2009). A similar approach has 
been described by Ye et al. (Ye et ai, 2008): here the 
initial partition is such that no single vertex can be 
moved from its cluster to another without decreasing 
Q. Higher-quality modularities can be also achieved by 
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applying refinement strategies based on local search at 
various steps of the greedy agglomeration (Noack and 
Rotta, 2009). Such refinement procedures are similar 
to the technique proposed by Newman to improve 
the results of his spectral optimization of modularity 
((Newman, 2006b) and Section VI. A. 4). Another good 
strategy consists in alternating greedy optimization with 
stochastic perturbations of the partitions (Mei et al., 
2009). 



A different greedy approach has been introduced by 
Blondel et al. (Blondel et ai, 2008), for the general case 
of weighted graphs. Initially, all vertices of the graph are 
put in different communities. The first step consists of 
a sequential sweep over all vertices. Given a vertex i, 
one computes the gain in weighted modularity (Eq. 35) 
coming from putting i in the community of its neighbor 
j and picks the community of the neighbor that yields 
the largest increase of Q, as long as it is positive. At the 
end of the sweep, one obtains the first level partition. In 
the second step communities are replaced by superver- 
tices, and two supervertices are connected if there is at 
least an edge between vertices of the corresponding com- 
munities. In this case, the weight of the edge between 
the supervertices is the sum of the weights of the edges 
between the represented communities at the lower level. 
The two steps of the algorithm are then repeated, yield- 
ing new hierarchical levels and supergraphs (Fig. 12). We 
remark that modularity is always computed from the ini- 
tial graph topology: operating on supergraphs enables 
one to consider the variations of modularity for parti- 
tions of the original graph after merging and/or split- 
ting of groups of vertices. Therefore, at some iteration, 
modularity cannot increase anymore, and the algorithm 
stops. The technique is more limited by storage demands 
than by computational time. The latter grows like 0(to), 
so the algorithm is extremely fast and graphs with up 
to 10^ edges can be analyzed in a reasonable time on 
current computational resources. The software can be 
found at http : //f indcommunities . googlepages . com/. 
The modularity maxima found by the method are bet- 
ter than those found with the greedy techniques by 
Clauset et al. (Clausct et ai, 2004) and Wakita and 
Tsurumi (Wakita and Tsurumi, 2007). However, clos- 
ing communities within the immediate neighborhood of 
vertices may be inaccurate and yield spurious partitions 
in practical cases. So, it is not clear whether some of the 
intermediate partitions could correspond to meaningful 
hierarchical levels of the graph. Moreover, the results 
of the algorithm depend on the order of the sequential 
sweep over the vertices. 



We conclude by stressing that, despite the improve- 
ments and refinements of the last years, the accuracy of 
greedy optimization is not that good, as compared with 
other techniques. 



2. Simulated annealing 

Simulated annealing (Kirkpatrick et ai, 1983) is a 
probabilistic procedure for global optimization used in 
different fields and problems. It consists in performing 
an exploration of the space of possible states, looking for 
the global optimum of a function F, say its maximum. 
Transitions from one state to another occur with proba- 
bility 1 if increases after the change, otherwise with a 
probability exp(/?AF), where AF is the decrease of the 
function and /3 is an index of stochastic noise, a sort of 
inverse temperature, which increases after each iteration. 
The noise reduces the risk that the system gets trapped 
in local optima. At some stage, the system converges to a 
stable state, which can be an arbitrarily good approxima- 
tion of the maximum of -F, depending on how many states 
were explored and how slowly (3 is varied. Simulated an- 
nealing was first employed for modularity optimization 
by Guimera et al. (Guimera et ai, 2004). Its standard 
implementation (Guimera and Amaral, 2005) combines 
two types of "moves" : local moves, where a single vertex 
is shifted from one cluster to another, taken at random; 
global moves, consisting of mergers and splits of com- 
munities. Splits can be carried out in several distinct 
ways. The best performance is achieved if one optimizes 
the modularity of a bipartition of the cluster, taken as 
an isolated graph. This is done again with simulated an- 
nealing, where one considers only individual vertex move- 
ments, and the temperature is decreased until it reaches 
the running value for the global optimization. Global 
moves reduce the risk of getting trapped in local min- 
ima and they have proven to lead to much better optima 
than using simply local moves (Masscn and Doyc, 2005; 
Mcdus et at, 2005). In practical applications, one typi- 
cally combines local moves with n global ones in one 
iteration. The method can potentially come very close 
to the true modularity maximum, but it is slow. The 
actual complexity cannot be estimated, as it heavily de- 
pends on the parameters chosen for the optimization (ini- 
tial temperature, cooling factor), not only on the graph 
size. Simulated annealing can be used for small graphs, 
with up to about 10^ vertices. 



3. Extremal optimization 

Extremal optimization (EO) is a heuristic search pro- 
cedure proposed by Boettcher and Percus (Boettcher and 
Pcrcus, 2001), in order to achieve an accuracy compara- 
ble with simulated annealing, but with a substantial gain 
in computer time. It is based on the optimization of local 
variables, expressing the contribution of each unit of the 
system to the global function at study. This technique 
was used for modularity optimization by Duch and Are- 
nas (Duch and Arenas, 2005). Modularity can be indeed 
written as a sum over the vertices: the local modularity 
of a vertex is the value of the corresponding term in this 
sum. A fitness measure for each vertex is obtained by 
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FIG. 12 Hierarchical optimization of modularity by Blondel et al. (Blondel et al, 2008). The diagram shows two iterations of 
the method, starting from the graph on the left. Each iteration consists of a step, in which every vertex is assigned to the (local) 
cluster that produces the largest modularity increase, followed by a successive transformation of the clusters into vertices of a 
smaller (weighted) graph, representing the next higher hierarchical level. Reprinted figure with permission from Ref. (Blondel 
et al, 2008). ©2008 by lOP Publishing and SISSA. 



dividing the local modularity of the vertex by its degree, 
as in this case the measure does not depend on the degree 
of the vertex and is suitably normalized. One starts from 
a random partition of the graph in two groups with the 
same number of vertices. At each iteration, the vertex 
with the lowest fitness is shifted to the other cluster. The 
move changes the partition, so the local fitnesses of many 
vertices need to be recalculated. The process continues 
until the global modularity Q cannot be improved any 
more by the procedure. This technique reminds one of 
the Kernighan-Lin (Kernighan and Lin, 1970) algorithm 
for graph partitioning (Section IV. A), but here the sizes 
of the communities are determined by the process itself, 
whereas in graph partitioning they are fixed from the be- 
ginning. After the bipartition, each cluster is considered 
as a graph on its own and the procedure is repeated, as 
long as Q increases for the partitions found. The pro- 
cedure, as described, proceeds deterministically from the 
given initial partition, as one shifts systematically the 
vertex with lowest fitness, and is likely to get trapped 
in local optima. Better results can be obtained if one 
introduces a probabilistic selection, in which vertices are 
ranked based on their fitness values and one picks the 
vertex of rank q with the probability P{q) ^ q^^ (r-EO, 



(Boettchcr and Pcrcus, 2001)). The algorithm finds very 
good estimates of the modularity maximum, and per- 
forms very well on the benchmark of Girvan and New- 
man (Girvan and Newman, 2002) (Section XV. A) . Rank- 
ing the fitness values has a cost 0{n\ogn), which can be 
reduced to 0{n) if heap data structures are used. Choos- 
ing the vertex to be shifted can be done with a binary 
search, which amounts to an additional factor O(logn). 
Finally, the number of steps needed to verify whether 
the running modularity maximum can be improved or 
not is also 0{n). The total complexity of the method 
is then 0(?T,^logn). We conclude that EO represents a 
good tradeoff between accuracy and speed, although the 
use of recursive bisectioning may lead to poor results on 
large networks with many communities. 



4. Spectral optimization 

Modularity can be optimized using the eigenvalues and 
eigenvectors of a special matrix, the modularity matrix 
B, whose elements are 
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Here the notation is the same used in Eq. 14. Let s be 
the vector representing any partition of the graph in two 
clusters A and B: Si = +1 if vertex i belongs to A, 
Si — —1 if i belongs to B. Modularity can be written as 
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The last expression indicates standard matrix products. 
The vector s can be decomposed on the basis of eigen- 
vectors Ui [i — l,...,n) of the modularity matrix B: 
s — ^jCiUi, with Gi = uf ■ s. By plugging this ex- 
pression of s into Eq. 29 one finally gets 
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where f3i is the eigenvalue of B corresponding to the 
eigenvector u^. Eq. 30 is analogous to Eq. 19 for the 
cut size of the graph partitioning problem. This sug- 
gests that one can optimize modularity on bipartitions 
via spectral bisection (Section IV. A), by replacing the 
Laplacian matrix with the modularity matrix (Newman, 
2006a,b). Like the Laplacian matrix, B has always the 
trivial eigenvector (1,1,..., 1) with eigenvalue zero, be- 
cause the sum of the elements of each row/column of 
the matrix vanishes. From Eq. 30 we see that, if B has 
no positive eigenvalues, the maximum coincides with the 
trivial partition consisting of the graph as a single cluster 
(for which Q = 0), i. e. it has no community structure. 
Otherwise, one has to look for the eigenvector of B with 
largest (positive) eigenvalue, Ui, and group the vertices 
according to the signs of the components of Ui, just like 
in Section IV. A. Here, however, one does not need to 
specify the sizes of the two groups: the vertices with pos- 
itive components are all in one group, the others in the 
other group. If, for example, the component of Ui cor- 
responding to vertex i is positive, but we set Si = —1, 
the modularity is lower than by setting Si = +1. The 
values of the components of Ui are also informative, as 
they indicate the level of the participation of the ver- 
tices to their communities. In particular, components 
whose values are close to zero lie at the border between 
the two clusters and can be well considered as belonging 
to both of them. The result obtained from the spectral 
bipartition can be further improved by shifting single ver- 
tices from one community to the other, such to have the 
highest increase (or lowest decrease) of the modularity of 
the resulting graph partition. This refinement technique, 
similar to the Kernighan-Lin algorithm (Section IV. A), 
can be also applied to improve the results of other op- 
timization techniques (e.g. greedy algorithms, extremal 
optimization, etc.). The procedure is repeated for each 



of the clusters separately, and the number of communi- 
ties increases as long as modularity does. At variance 
with graph partitioning, where one needs to fix the num- 
ber of clusters and their size beforehand, here there is a 
clear-cut stopping criterion, represented by the fact that 
cluster subdivisions are admitted only if they lead to a 
modularity increase. We stress that modularity needs 
to be always computed from the full adjacency matrix 
of the original graph^^. The drawback of the method 
is similar as for spectral bisection, i. e. the algorithm 
gives the best results for bisections, whereas it is less ac- 
curate when the number of communities is larger than 
two. Recently, Sun et al. (Sun et ai, 2009) have added a 
step after each bipartition of a cluster, in that single ver- 
tices can be moved from one cluster to another and even 
form the seeds of new clusters. We remark that the pro- 
cedure is different from the Kernighan-Lin-like refining 
steps, as here the number of clusters can change. This 
variant, which does not increase the complexity of the 
original spectral optimization, leads to better modular- 
ity maxima. Moreover, one does not need to stick to 
bisectioning, if other eigenvectors with positive eigenval- 
ues of the modularity matrix are used. Given the first p 
eigenvectors, one can construct n p-dimensional vectors, 
each corresponding to a vertex, just like in spectral par- 
titioning (Section IV. D). The components of the vector 
of vertex i are proportional to the p entries of the eigen- 
vectors in position i. Then one can define community 
vectors, by summing the vectors of vertices in the same 
community. It is possible to show that, if the vectors of 
two communities form an angle larger that 7r/2, keeping 
the communities separate yields larger modularity than if 
they are merged (Fig. 13). In this way, in a p-dimensional 
space the modularity maximum corresponds to a parti- 
tion in at most p + I clusters. Community vectors were 
used by Wang et al. to obtain high-modularity partitions 
into a number of communities smaller than a given max- 
imum (Wang et ai, 2008). In particular, if one takes the 
eigenvectors corresponding to the two largest eigenval- 
ues, one can obtain a split of the graph in three clusters: 
in a recent work, Richardson et al. presented a fast tech- 
nique to obtain graph tripartitions with large modularity 
along these lines (Richardson et ai, 2009). The eigenvec- 
tors with the most negative eigenvalues can also be used 
to extract useful information, like the presence of a pos- 
sible multipartite structure of the graph, as they give the 
most relevant contribution to the modularity minimum. 

The spectral optimization of modularity is quite fast. 
The leading eigenvector of the modularity matrix can be 
computed with the power method, by repeatedly mul- 
tiplying B by an arbitrary vector (not orthogonal to 



Richardson et al. (Richardson et al., 2009) have actually shown 
that if one instead seeks the optimization of modularity for each 
cluster, taken as an independent graph, the combination of spec- 
tral bisectioning and the post-processing technique may yield 
better results for the final modularity optima. 
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FIG. 13 Spectral optimization of modularity by New- 
man (Newman, 2006a,b). By using the first two eigenvec- 
tors of the modularity matrix, vertices can be represented as 
points on a plane. By cutting the plane with a line passing 
through the origin (like the dashed line in the figure) one ob- 
tains bipartitions of the graph with possibly high modularity 
values. Reprinted figure with permission from Ref. (Newman, 
2006a). ©2006 by the American Physical Society. 



Ui). The number of required iterations to reach con- 
vergence is 0{n). Each multiplication seems to require a 
time O(n^), as B is a complete matrix, but the pecuhar 
form of B allows for a much quicker calculation, taking 
time 0{m + n). So, a graph bipartition requires a time 
0[n(m -|- n)], or 0{n^) on a sparse graph. To find the 
modularity optimum one needs a number of subsequent 
bipartitions that equals the depth d of the resulting hier- 
archical tree. In the worst-case scenario, d = 0{n), but in 
practical cases the procedure usually stops much before 
reaching the leaves of the dendrogram, so one could go 
for the average value (d) ~ log n, for a total complexity 
of 0{n^ log n). The algorithm is faster than extremal op- 
timization and it is also slightly more accurate, especially 
for large graphs. The modularity matrix and the corre- 
sponding spectral optimization can be trivially extended 
to weighted graphs. 

A different spectral approach had been previously pro- 
posed by White and Smyth (White and Smyth, 2005). 
Let W indicate the weighted adjancency matrix of a 



graph Q. A partition of ^ in clusters can be described 
through an n X fc assignment matrix X, where Xic = 1 if 
vertex i belongs to cluster c, otherwise Xic — 0. It can 
be easily shown that, up to a multiplicative constant, 
modularity can be rewritten in terms of the matrix X as 

Qcxtr[X^(W-2?)X] = -tr[X^LQX], (31) 

where W is a diagonal matrix with identical elements, 
equal to the sum of all edge weights, and the entries of 
V are V^j = kikj, where ki is the degree of vertex i. The 
matrix Lg = 2? — W is called the Q-Laplacian. Finding 
the assignment matrix X that maximizes Q is an NP- 
complete problem, but one can get a good approximation 
by relaxing the constraint that the elements of X have to 
be discrete. By doing so Q becomes a sort of continuous 
functional of X and one can determine the extremes of Q 
by setting its first derivative (with respect to X) to zero. 
This leads to the eigenvalue problem 

LqX = XA. (32) 

Here A is a diagonal matrix. Eq. 32 turns modularity 
maximization into a spectral graph partitioning problem 
(Section IV. D), using the Q-Laplacian matrix. A nice 
feature of the Q-Laplacian is that, for graphs which are 
not too small, it can be approximated (up to constant 
factors) by the transition matrix W, obtained by nor- 
malizing W such that the sum of the elements of each 
row equals one. Eq. 32 is at the basis of the algorithms 
developed by White and Smyth, which search for parti- 
tions with at most K clusters, where if is a predefined 
input parameter that may be suggested by preliminary 
information on the graph cluster structure. The first 
K ~~\ eigenvectors of the transition matrix VV (excluding 
the trivial eigenvector with all equal components) can be 
computed with a variant of the Lanczos method (Demmel 
et ai, 2000). Since the eigenvector components are not 
integer, the eigenvectors do not correspond directly to a 
partition of the graph in clusters. However, as usual in 
spectral graph partitioning, the components of the eigen- 
vectors can be used as coordinates of the graph vertices 
in an Euclidean space and fc-means clustering is applied 
to obtain the desired partition. White and Smyth pro- 
posed two methods to derive the clustering after embed- 
ding the graph in space. Both methods have a worst- 
case complexity 0{K^n + Km), which is essentially lin- 
ear in the number of vertices of the graph if the latter 
is sparse and K <^ n. However, on large and sparse 
graphs, K could scale with the number of vertices n, 
so the procedure might become quite slow. In order to 
speed up calculations without losing much accuracy in 
the final estimates of the maximum modularity, Ruan 
and Zhang (Ruan and Zhang, 2007) have proposed an al- 
gorithm, called Kcut, that applies recursively the method 
by White and Smyth, in a slightly modified form: after 
a first application to the full graph, in the next iteration 
the method is applied to all clusters of the first partition, 
treated as independent networks, and so on. The proce- 
dure goes on as long as the modularity of the resulting 
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partitions keeps growing. The advantage of Kcut is that 
one can play with low values for the (maximal) number 
of clusters £ at each iteration; if partitions are balanced, 
after a levels of recursions, the number of clusters of the 
partition is approximately K — 1°-. Therefore the com- 
plexity of Kcut is 0[{n + m) log K] for a final partition in 
(at most) K clusters, which is much lower than the com- 
plexity of the algorithm by White and Smyth. Ruan and 
Zhang tested Kcut on artificial graphs generated with the 
planted ^-partition model (Section XV), and on real net- 
works including Zachary's karate club (Zachary, 1977), 
the American college football network (Girvan and New- 
man, 2002) and two collaboration networks of Jazz musi- 
cians (Gleiser and Danon, 2003) and physicists (Newman, 
2001): the accuracy of Kcut is comparable to that of the 
algorithm by White and Smyth, though generally lower. 



5. Other optimization strategies 

Agarwal and Kempe have suggested to maximize mod- 
ularity within the framework of mathematical program- 
ming (Agarwal and Kempe, 2008). In fact, modularity 
optimization can be formulated both as a linear and as 
a quadratic program. In the first case, the variables are 
defined on the links: Xij = if i and j are in the same 
cluster, otherwise Xij = 1. The modularity of a partition, 
up to a multiplicative constant, can then be written as 

Q (x^B,j{l- Xij), (33) 

where B is the modularity matrix defined by Newman 
(see Section VI. A. 4). Eq. 33 is linear in the variables 
{a;}, which must obey the constraint Xij < Xik + Xkj, be- 
cause, if i and j are in the same cluster, and so are i and 
k, then j and k must be in that cluster too. Maximiz- 
ing the expression in Eq. 33 under the above constraint 
is A^P-hard, if the variables have to be integer as re- 
quired. However, if one relaxes this condition by using 
real-valued {a;}, the problem can be solved in polyno- 
mial time (Karloff, 1991). On the other hand, the solu- 
tion does not correspond to an actual partition, as the 
X variables are fractional. To get clusters out of the {x} 
one needs a rounding step. The values of the x variables 
are used as sort of distances in a metric space (the trian- 
gular inequality is satisfied by construction): clusters of 
vertices "close" enough to each other (i. e. whose mutual 
X variables are close to zero) are formed and removed 
until each vertex is assigned to a cluster. The resulting 
partition is further refined with the same post-processing 
technique used by Newman for the spectral optimization 
of modularity, i. e. by a sequence of steps similar to 
those of the algorithm by Kernighan and Lin (see Sec- 
tion VI. A. 4). Quadratic programming can be used to 
get bisections of graphs with high modularity, that can 
be iterated to get a whole hierarchy of partitions as in 
Newman's spectral optimization. One starts from one of 



the identities in Eq. 29 

Q= i^E^'j(l + ^»*j)' (34) 

ij 

where Si = ±1, depending on whether the vertex belongs 
to the first or the second cluster. Since the optimiza- 
tion of the expression in Eq. 34 is A^P-complete, one 
must relax again the constraint on the variables s be- 
ing integer. A possibility is to transform each s into an 
ri-dimensional vector s and each product in the scalar 
product between vectors. The vectors are normalized so 
that their tips lie on the unit-sphere of the n-dimensional 
space. This vector problem is polynomially solvable, but 
one needs a method to associate a bipartition to the set 
of n vectors of the solution. Any (n — l)-dimensional 
hyperplane centered at the origin cuts the space in two 
halves, separating the vectors in two subsets. One can 
then choose multiple random hyperplanes and pick the 
one which delivers the partition with highest modular- 
ity. As in the linear program, a post-processing tech- 
nique a la Newman (see Section VI. A. 4) is used to im- 
prove the results of the procedure. The two methods 
proposed by Agarwal and Kempe are strongly limited by 
their high computational complexity, due mostly to the 
large storage demands, making graphs with more than 
10^ vertices intractable. On the other hand, the idea of 
applying mathematical programming to graph clustering 
is promising. The code of the algorithms can be down- 
loaded from http://www-scf.usc.edu/~gaurava/. In 
a recent work (G. Xu et ai, 2007), Xu et al. have opti- 
mized modularity using mixed-integer mathematical pro- 
gramming, with both integer and continuous variables, 
obtaining very good approximations of the modularity 
optimum, at the price of high computational costs. Chen 
et al. have used integer linear programming to transform 
the initial graph into an optimal target graph consist- 
ing of disjoint cliques, which effectively yields a parti- 
tion (Chen et al., 2008). Berry et al. have formulated 
the problem of graph clustering as a facility location prob- 
lem (Hillicr and Liebcrman, 2004), consisting in the min- 
imization of a cost function based on a local variation of 
modularity (Berry et al., 2007). 

Lehmann and Hansen (Lehmann and Hansen, 2007) 
optimized modularity via mean field annealing (Peterson 
and Anderson, 1987), a deterministic alternative to sim- 
ulated annealing (Kirkpatrick et ai, 1983). The method 
uses Gibbs probabilities to compute the conditional mean 
value for the variable of a vertex, which indicates its 
community membership. By making a mean field ap- 
proximation on the variables of the other vertices in the 
Gibbs probabilities one derives a self-consistent set of 
non-linear equations, that can be solved by iteration in 
a time 0[{m + n)n]. The method yields better modular- 
ity maxima than the spectral optimization by Newman 
(Section VI. A. 4), at least on artificial graphs with built- 
in community structure, similar to the benchmark graphs 
by Girvan and Newman (Section XV. A). 
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Genetic algorithms (Holland, 1992) have also been 
used to optimize modularity. In a standard genetic algo- 
rithm one has a set of candidate solutions to a problem, 
which are numerically encoded as chromosomes, and an 
objective function to be optimized on the space of solu- 
tions. The objective function plays the role of biological 
fitness for the chromosomes. One usually starts from 
a random set of candidate solutions, which are progres- 
sively changed through manipulations inspired by bio- 
logical processes regarding real chromosomes, like point 
mutation (random variations of some parts of the chro- 
mosome) and crossing over (generating new chromosomes 
by merging parts of existing chromosomes) . Then, the fit- 
ness of the new pool of candidates is computed and the 
chromosomes with the highest fitness have the greatest 
chances to survive in the next generation. After sev- 
eral iterations only solutions with large fitness survive. 
In a work by Tasgin et al. (Tasgin et al., 2007), parti- 
tions are the chromosomes and modularity is the fitness 
function. With a suitable choice of the algorithm param- 
eters, like the number of chromosomes and the rates of 
mutation and crossing over, Tasgin et al. could obtain 
results of comparative quality as greedy modularity op- 
timization on Zachary's karate club (Zachary, 1977), the 
college football network (Girvan and Newman, 2002) and 
the benchmark by Girvan and Newman (Section XV. A). 
Genetic algorithms were also adopted by Liu et al. (Liu 
et al., 2007). Here the maximum modularity partition is 
obtained via successive bipartitions of the graph, where 
each bipartition is determined by applying a genetic algo- 
rithm to each subgraph (starting from the original graph 
itself), which is considered isolated from the rest of the 
graph. A bipartition is accepted only if it increases the 
total modularity of the graph. 

In Section III.C.2 we have seen that the modularity 
maximum is obtained for the partition that minimizes 
the difference between the cut size and the expected cut 
size of the partition (Eq. 17). In the complete weighted 
graph Qw such that the weight Wij of an edge is 1 — 
kikj/2m, if i and j are connected in Q, and —kikj /2m if 
they are not, the difference |Cutp| — ExCutp is simply 
the cut size of partition V. So, maximizing modularity 
for Q is equivalent to the problem of finding the partition 
with minimal cut size of the weighted graph Qy^, i. e. 
to a graph partitioning problem. The problem can then 
be efficiently solved by using existing software for graph 
partitioning (Djidjev, 2007). 



B. Modifications of modularity 

In the most recent literature on graph clustering sev- 
eral modifications and extensions of modularity can be 
found. They are usually motivated by specific classes of 
clustering problems and/or graphs that one may want to 
analyze. 

Modularity can be easily extended to graphs with 
weighted edges (Newman, 2004). One needs to replace 



the degrees ki and kj in Eq. 14 with the strengths Si 
and Sj of vertices i and j. We remind that the strength 
of a vertex is the sum of the weights of edges adjacent 
to the vertex (Section A.l). For a proper normalization, 
the number of edges m in Eq. 14 has to be replaced by 
the sum W of the weights of all edges. The product 
SiSj/2W is now the expected weight of the edge ij in 
the null model of modularity, which has to be compared 
with the actual weight Wij of that edge in the original 
graph. This can be understood if we consider the case in 
which all weights are multiples of a unit weight, so they 
can be rewritten as integers. The weight of the connec- 
tion between two nodes can then be replaced by as many 
edges between the nodes as expressed by the number of 
weight units. For the resulting multigraph we can use 
the same procedure as in the case of unweighted graphs, 
which leads to the formally identical expression 

which can be also written as a sum over the modules 

c— 1 ^ ^ 

where Wc is the sum of the weights of the internal edges 
of module c and Sc is the sum of the strengths of the 
vertices of c. If edge weights are not mutually commen- 
surable, one can always represent them as integers with 
good approximation, provided a sufficiently small weight 
unit is adopted, so the expressions for weighted modu- 
larity of Eqs. 35, 36 are generally valid. In principle, 
weights can be assigned to the edges of an undirected 
graph, by using any measure of similarity /correlation be- 
tween the vertices (like, e. g., the measures introduced 
in Section III.B.4). In this way, one could derive the 
corresponding weighted modularity and use it to detect 
communities, with a potentially better exploitation of the 
structural information of the graph as compared to stan- 
dard modularity (Feng et al., 2007; Ghosh and Lcrman, 
2008). 

Modularity has also a straightforward extension to the 
case of directed graphs (Arenas et al., 2007; Leicht and 
Newman, 2008). If an edge is directed, the probabil- 
ity that it will be oriented in either of the two possible 
directions depends on the in- and out-degrees of the end- 
vertices. For instance, taken two vertices A and B, where 
A (B) has a high (low) indegree and low (high) outde- 
gree, in the null model of modularity an edge will be 
much more likely to point from B to A than from A to 
B. Therefore, the expression of modularity for directed 
graphs reads 

1 / kf''*k'"\ 

= - V A, - ^ 1- b{C„ Cj), (37) 

m ^ y m j 

where the factor 2 in the denominator of the second sum- 
mand has been dropped because the sum of the indegrees 
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smaller, the larger the number of communities including 
its endvertices. Nicosia et al. (Nicosia et ai, 2009) have 
made some more general considerations on the problem of 
extending modularity to the case of overlapping commu- 
nities. They considered the case of directed unweighted 
networks, starting from the following general expression 
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FIG. 14 Problem of the directed modularity introduced by 
Arenas et al. (Arenas et al, 2007). The two situations illus- 
trated are equivalent for modularity, as vertices A and A' , as 
well as B and B' , have identical indegrees and out degrees. 
In this way, the optimization of directed modularity is not 
able to distinguish a situation in which there is directed flow 
(top) or not (bottom). Reprinted figure with permission from 
Ref. (Kim et al, 2009). 



(outdegrees) equals m, whereas the sum of the degrees of 
the vertices of an undirected graph equals 2m; the factor 
2 in the denominator of the prefactor has been dropped 
because the number of non-vanishing elements of the ad- 
jacency matrix is m, not 2m as in the symmetric case 
of an undirected graph. If a graph is both directed and 
weighted, formulas 35 and 37 can be combined as 
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and are the indegree and outdegree of ver- 



where kl 

tices i and j, the index c labels the communities and r. 
Sijc express the contributions to the sum corresponding 
to the edge ij in the network and in the null model, due 
to the multiple memberships of i and j. If there is no 
overlap between the communities, Tijc = Sijc = <5ciCjC, 
where ci and Cj correspond to the communities of i and 
j. In this case, the edge ij contributes to the sum only 
if Ci — Cj , as in the original definition of modularity. For 
overlapping communities, the coefficients Tijc, Sijc must 
depend on the membership coefficients ai_c, ctj,c of ver- 
tices i and j. One can assume that r^c — J^{cti.c,Oij^c), 
where is some function. The term s^c is related to 
the null model of modularity, and it must be handled 
with care. In modularity's original null model edges are 
formed by joining two random stubs, so one needs to 
define the membership of a random stub in the various 
communities. If we assume that there is no correlation 
a priori between the membership coefficients of any two 
vertices, we can assign to a stub originating from a vertex 
i in community c the average membership corresponding 
to all edges which can be formed with i. On a directed 
graph we have to distinguish between outgoing and in- 
coming stubs, so one has 
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which is the most general (available) expression of mod- 
ularity (Arenas et al., 2007). Kim et al. (Kim et ai, 
2009) have remarked that the directed modularity of 
Eq. 37 may not properly account for the directedness 
of the edges (Fig. 14), and proposed a different defi- 
nition based on diffusion on directed graphs, inspired 
by Google's PageRank algorithm (Brin and Page, 1998). 
Rosvall and Bergstrom raised similar objections (Rosvall 
and Bergstrom, 2008). 

If vertices may belong to more clusters, it is not obvious 
how to find a proper generalization of modularity. In fact, 
there is no unique recipe. Shen et al. (Shcn et al, 2009), 
for instance, suggested the simple definition 
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Here Oi is the number of communities including vertex i. 
The contribution of each edge to modularity is then the 
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and one can write the following general expression for 
modularity 
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The question now concerns the choice of the function 
J-{ai^ci ckj.c)- If the formula of Eq. 43 is to be an exten- 
sion of modularity to the case of overlapping communi- 
ties, it has to satisfy some general properties of classi- 
cal modularity. For instance, the modularity value of a 
cover consisting of the whole network as a single cluster 
should be zero. It turns out that a large class of func- 
tions yield an expression for modularity that fulfills this 
requirement. Otherwise, the choice of is rather arbi- 
trary and good choices can be only tested a posteriori, 
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based on the results of the optimization. Membership co- 
efficients are also present in an extension of modularity to 
overlapping communities proposed by Shen et al. (Shen 
et ai, 2009). Here the membership coefficient of vertex 
V in community c is a sum over the edges of v belonging 
to c, where each edge has a weight proportional to the 
number of maximal cliques of c containing the edge. 

Gaertler et al. have introduced quality measures based 
on modularity's principle of the comparison between a 
variable relative to the original graph and the correspond- 
ing variable of a null model (Gaertler et al., 2007). They 
remark that modularity is just the difference between 
the coverage of a partition and the expected coverage 
of the partition in the null model. We remind that the 
coverage of a partition is the ratio between the number 
of edges within clusters and the total number of edges 
(Section III.C.2). Based on this observation, Gaertler et 
al. suggest that the comparison between the two terms 
can be done with other binary operations as well. For 
instance, one could consider the ratio 
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„ and Sp^^j: are better 
and Sp^^j on the arti- 



where the notation is the same as in Eq. 15. This can 
be done as well for any variable other than coverage. 
By using performance, for instance, (Section nLC.2) one 
obtains two new quality functions S~f and Sj~„„f, cor- 
responding to taking the difference or the ratio between 
performance and its null model expectation value, respec- 
tively. Gaertler et al. compared the results obtained with 
the four functions S~g^ = Q, Sj^^, S~^^f and S^^^j, on 
a class of benchmark graphs with built-in cluster struc- 
ture (Section XV. A) and social networks. They found 
that the "absolute" variants 
than the "relative" variants S*^, 
ficial benchmarks, whereas S^„^f is better on social net- 
works^^. Furthermore S~f is better than the standard 

per J 

modularity 5*^^,. 

Modifications of modularity's null model have been in- 
troduced by Massen and Doye (Masscn and Doyc, 2005) 
and Muff et al. (Muff et al, 2005). Massen and Doye's 
null model is still a graph with the same expected degree 
sequence as the original, and with edges rewired at ran- 
dom among the vertices, but one imposes the additional 
constraint that there can be neither multiple edges be- 
tween a pair of vertices nor edges joining a vertex with 
itself (loops or self-edges) . This null model is more realis- 
tic, as multiple edges and loops are usually absent in real 
graphs. The maximization of the corresponding modified 
modularity yields partitions with smaller average cluster 
size than standard modularity. The latter tends to dis- 
favor small communities, because the actual densities of 



The comparison was done by computing the values of significance 
indices like coverage and performance on the final partitions. 



edges inside small communities hardly exceed the null 
model densities, which are appreciably enhanced by the 
contributions from multiple connections and loops. Muff 
et al. proposed a local version of modularity, in which the 
expected number of edges within a module is not calcu- 
lated with respect to the full graph, but considering just 
a portion of it, namely the subgraph including the mod- 
ule and its neighbouring modules. Their motivation is 
the fact that modularity's null model implicitly assumes 
that each vertex could be attached to any other, whereas 
in real cases a cluster is usually connected to few other 
clusters. On a directed graph, their localized modularity 
LQ reads 
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In Eq. 45 Ic is the number of edges inside cluster c, d™ 
(d°"*) the total internal (external) degree of cluster c 
and ic„ the total number of edges in the subgraph com- 
prising cluster c and its neighbor clusters. The local- 
ized modularity is not bounded by 1, but can take any 
value. Its maximization delivers more accurate partitions 
than standard modularity optimization on a model net- 
work describing the social interactions between children 
in a school (school network) and on the metabolic and 
protein-protein interaction networks of E. coli. 

Reichardt and Bornholdt have shown that it is possible 
to reformulate the problem of community detection as 
the problem of finding the ground state of a spin glass 
model (Reichardt and Bornholdt, 2006a). Each vertex i 
is labeled by a Potts spin variable ai , which indicates the 
cluster including the vertex. The basic principle of the 
model is that edges should connect vertices of the same 
class (i. e. same spin state), whereas vertices of different 
classes (i. e. different spin states) should be disconnected 
(ideally). So, one has to energetically favor edges between 
vertices in the same class, as well as non-edges between 
vertices in different classes, and penalize edges between 
vertices of different classes, along with non-edges between 
vertices in the same class. The resulting Hamiltonian of 
the spin model is 
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J{Aij--fPij)S{(7i, (7j), 

(46) 

where J is a constant expressing the coupling strength, 
Aij are the elements of the adjacency matrix of the 
graph, 7 > a parameter expressing the relative con- 
tribution to the energy from existing and missing edges, 
and pij is the expected number of links connecting i and 
j for a null model graph with the same total number of 
edges m of the graph considered. The system is a spin 
glass (Mezard et al., 1987), as the couplings Jij between 
spins are both ferromagnetic (on the edges of the graph, 
provided jpij < 1) and antiferromagnetic (between dis- 



connected vertices, as Aj 



and J, , 



-JlPij < 0). 



The multiplicative costant J is irrelevant for practical 
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purposes, so in the following we set J = 1. The range of 
the spin-spin interaction is infinite, as there is a non-zero 
coupling between any pair of spins. Eq. 46 bears a strong 
resemblance with the expression of modularity of Eq. 14. 
In fact, if 7 = 1 and pij = kikj/2m we recover exactly 
modularity, up to a factor — l/m. In this case, finding the 
spin configuration for which the Hamiltonian is minimal 
is equivalent to maximizing modularity. Eq. 46 is much 
more general than modularity, though, as both the null 
model and the parameter 7 can be arbitrarily chosen. In 
particular, the value of 7 determines the importance of 
the null model term pij in the quality function. Eq. 46 
can be rewritten as 

s s — 1 

= ^ [^rs - l{^rs)p,j] ^ ^ ars- (47) 

Here, the sums run over the clusters: Ig and Irs indi- 
cate the number of edges within cluster s and between 
clusters r and s, respectively; (Zs)pij and {lrs)pij are the 
corresponding null model expectation values. Eq. 47 de- 
fines the coefficients Cgs of cohesion and a^s of adhesion. 
If a subset of a cluster s has a larger coefficient of adhe- 
sion with another cluster r than with its complement in 
s, the energy can be reduced by merging the subset with 
cluster r. In the particular case in which the coefficient 
of adhesion of a subset Q' oi a. cluster s with its com- 
plement in the cluster exactly matches the coefiicient of 
adhesion of Q' with another cluster r, the partitions in 
which Q' stays within s or is merged with r have the same 
energy. In this case one can say that clusters r and s are 
overlapping. In general, the partition with minimum en- 
ergy has the following properties: 1) every subset of each 
cluster has a coefficient of adhesion with its complement 
in the cluster not smaller than with any other cluster; 

2) every cluster has non-negative coefficient of cohesion; 

3) the coefficient of adhesion between any two clusters is 
non-positive. 

By tuning the parameter 7 one can vary the number 
of clusters in the partition with minimum energy, going 
from a single cluster comprising all vertices (7 = 0), to n 
clusters with a single vertex (7 00). So, 7 is a resolu- 
tion parameter that allows to explore the cluster struc- 
ture of a graph at different scales (see Section VI. C). The 
authors used single spin heatbath simulated annealing al- 
gorithms to find the ground state of the Hamiltonian of 
Eq. 46. 

Another generalization of modularity was recently sug- 
gested by Arenas et al. (Arenas et ai, 2008a). They re- 
marked that the fundamental unit to define modularity is 
the edge, but that high edge densities inside clusters usu- 
ally imply the existence of long-range topological correla- 
tions between vertices, which are revealed by the presence 
of motifs (Milo et ai, 2002), i. e. connected undirected 
subgraphs, like cycles (Section A.l). For instance, a high 
edge density inside a cluster usually means that there 
are also several triangles in the cluster, and compara- 



tively few between clusters, a criterion that has inspired 
on its own popular graph clustering algorithms (Palla 
et ai, 2005; Radicchi et ai, 2004). Modularity can then 
be simply generalized by comparing the density of motifs 
inside clusters with the expected density in modularity's 
null model {motif modularity). As a particular case, the 
triangle modularity of a partition C reads 

'^A^j{C)Ajk{C)Akt{C) y^^nij{C)njk{C)nki{C) 
Qa{C) = ^ , , , 

ijk ijk 

(48) 

where Aij{C) = AijS{Ci, Cj) [Ci is the label of the clus- 
ter i belongs to), Uij — kikj {ki is the degree of vertex 
i) and nij{C) — nij6{Ci,Cj). If one chooses as motifs 
paths with even length, and removes the constraint that 
all vertices of the motif /path should stay inside the same 
cluster, maximizing motif modularity could reveal the ex- 
istence of multipartite structure. For example, if a graph 
is bipartite, one expects to see many 2-paths starting 
from one vertex class and returning to it from the other 
class. Motif modularity can be trivially extended to the 
case of weighted graphs. 

Several graphs representing real systems are built out 
of correlation data between elements. Correlation ma- 
trices are very common in the study of complex sys- 
tems: well-known examples are the correlations of price 
returns, which are intensively studied by economists and 
econophysicists (Mantcgna and Stanley, 2000). Corre- 
lations may be positive as well as negative, so the cor- 
responding weighted edges indicate both attraction and 
repulsion between pairs of vertices. Usually the correla- 
tion values are filtered or otherwise transformed such to 
eliminate the weakest correlations and anticorrelations 
and to maintain strictly positive weights for the edges, 
yielding graphs that can be treated with standard tech- 
niques. However, ignoring negative correlations means to 
give up useful information on the relationships between 
vertices. Finding clusters in a graph with both positive 
and negative weights is called correlation clustering prob- 
lem (Bansal et al., 2004). According to intuition, one 
expects that vertices of the same cluster are linked by 
positive edges, whereas vertices of different clusters are 
linked by negative edges. The best cluster structure is 
the partition that maximizes the sum of the strengths 
(in absolute value) of positive edges within clusters and 
negative edges between clusters, or, equivalently, the par- 
tition that minimizes the sum of the strengths (in abso- 
lute value) of positive edges between clusters and neg- 
ative edges within clusters. This can be formulated by 
means of modularity, if one accounts for the contribu- 
tion of the negative edges. A natural way to proceed is 
to create two copies of the graph at study: in one copy 
only the weights of the positive edges are kept, in the 
other only the weights of the negative edges (in abso- 
lute value). By applying Eq. 35 to the same partition 
of both graphs, one derives the contributions Q"*" and 
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Q~ to the modularity of that partition for the original 
graph. Gomez et al. define the global modularity as 
a linear combination of Q"*" and Q", that accounts for 
the relative total strengths of positive and negative edge 
weights (Gomez et al., 2009). Kaplan and Forrest (Ka- 
plan and Forrest, 2008) have proposed a similar expres- 
sion, with two important differences. First, they have 
used the total strength of the graph, i. e. the sum of 
the absolute values of all weights, to normalize and 
Q~ \ Gomez et al. instead have used the positive and the 
negative strengths, for Q"*" and Q", respectively, which 
seems to be the more natural choice looking at Eq. 35. 
Second, Kaplan and Forrest have given equal weight to 
the contributions of Q'^ and Q~ to their final expression 
of modularity, which is just the difference Q"*" — Q~ . In 
another work, Traag and Bruggenian (Traag and Bruggc- 
man, 2009) have introduced negative links in the general 
spin glass formulation of modularity of Reichardt and 
Bornholdt (Reichardt and Bornholdt, 2006a). Here the 
relative importance of the contribution of positive and 
negative edge weights is a free parameter, the tuning of 
which allows to detect communities of various sizes and 
densities of positive/negative edges. 

Some authors have pointed out that the original ex- 
pression of modularity is not ideal to detect communi- 
ties in bipartite graphs, which describe several real sys- 
tems, like food webs (Williams and Martinez, 2000), sci- 
entific (Newman, 2001) and artistic (Gleiser and Danon, 
2003) collaboration networks, etc.. Expressions of mod- 
ularity for bipartite graphs were suggested by Guimera 
et al. (Guimera et al., 2007) and Barber (Barber, 2007; 
Barber et al., 2008). Guimera et al. call the two classes 
of vertices actors and teams, and indicate with ti the de- 
gree of actor i and iria the degree of team a. The null 
model graphs are random graphs with the same expected 
degrees for the vertices, as usual. The bipartite modu- 
larity MsiV) for a partition V (of the actors) has the 
following expression 



(49) 



Here, Cij is the number of teams in which actors i and 
J are together and the sum X)a'^a("^a ~ 1) gives the 
number of ordered pairs of actors in the same team. The 
second ratio of each summand is the null model term, 
indicating the expected (normalized) number of teams 
for pairs of actors in cluster c. The bipartite modularity 
can also be applied to (unipartite) directed graphs: each 
vertex can be duplicated and assigned to both classes, 
based on its twofold role of source and target for the 
edges. 

Another interesting alternative was introduced by Bar- 
ber (Barber, 2007; Barber et al., 2008) and is a simple 
extension of Eq. 14. Let us suppose that the two vertex 
classes (red and blue) are made out of p and q vertices, re- 
spectively. The degree of a red vertex i is indicated with 
ki, that of a blue vertex j with dj. The adjacency ma- 
trix A of the graph is in block off-diagonal form, as there 



are edges only between red and blue vertices. Because 
of that, Barber assumes that the null model matrix P, 
whose element Pij indicates as usual the expected num- 
ber of edges between vertices i and j in the null model, 
also has the block off-diagonal form 



Or 



o 



(50) 



where the O are square matrices with all zero elements 
and Pij — kidj/m, as in the null model of standard mod- 
ularity (though other choices are possible) . The modular- 
ity maximum can be computed through the modularity 
matrix B = A — P, as we have seen in Section VI. A. 4. 
However, spectral optimization of modularity gives excel- 
lent results for bipartitions, while its performance wors- 
ens when the number of clusters is unknown, as it is 
usually the case. Barber has proposed a different opti- 
mization technique, called Bipartite Recursively Induced 
Modules (BRIM), based on the bipartite nature of the 
graph. The algorithm is based on the special expression 
of modularity for the bipartite case, for which once the 
partition of the red or the blue vertices is known, it is 
easy to get the partition of the other vertex class that 
yields the maximum modularity. Therefore, one starts 
from an arbitrary partition in c clusters of, say, the blue 
vertices, and recovers the partition of the red vertices, 
which is in turn used as input to get a better partition of 
the blue vertices, and so on until modularity converges. 
BRIM does not predict the number of clusters c of the 
graph, but one can obtain good estimates for it by ex- 
ploring different values with a simple bisection approach. 
Typically, for a given c the algorithm needs a few steps 
to converge, each step having a complexity 0{m). An 
expression of the number of convergence steps in terms 
of n and/or m still needs to be derived. 



C. Limits of modularity 

In this Section we shall discuss some features of mod- 
ularity, which are crucial to identify the domain of its 
applicability and ultimately to assess the issue of the re- 
liability of the measure for the problem of graph cluster- 
ing. 

An important question concerns the value of the max- 
imum modularity Qmax for a graph. We know that it 
must be non-negative, as there is always at least a par- 
tition with zero modularity, consisting in a single clus- 
ter with all vertices (Section III.C.2). However, a large 
value for the modularity maximum does not necessarily 
mean that a graph has community structure. Random 
graphs are supposed to have no community structure, 
as the linking probability between vertices is either con- 
stant or a function of the vertex degrees, so there is no 
bias a priori towards special groups of vertices. Still, ran- 
dom graphs may have partitions with large modularity 
values (Guimera et al., 2004; Reichardt and Bornholdt, 
2006a). This is due to fluctuations in the distribution of 
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edges in the graph, which in many graph reahzations is 
not homogeneous even if the hnking probabiUty is con- 
stant, Hke in Erdos-Renyi graphs. The fluctuations de- 
termine concentrations of hnks in some subsets of the 
graph, which then appear hke communities. According 
to the definition of modularity, a graph has community 
structure with respect to a random graph with equal size 
and expected degree sequence. Therefore, the modular- 
ity maximum of a graph reveals a significant community 
structure only if it is appreciably larger than the modu- 
larity maximum of random graphs of the same size and 
expected degree sequence. The significance of the mod- 
ularity maximum Qmax for a graph can be estimated by 
calculating the maximum modularity for many realiza- 
tions of the null model, obtained from the original graph 
by randomly rewiring its edges. One then computes the 
average (Q) nm and the standard deviation CTq of the 
results. The statistical significance of Qmax is indicated 
by the distance of Qmax from the null model average 
{Q)nm in units of the standard deviation Cg^^, i. e. by 
the z-score 

Qmax ~ {Q/NM 

^= Tjvm ■ (51) 

If z 3> 1, Qmax indicates strong community structure. 
Cutoff values of 2 — 3 for the z-scores are customary. 
This approach has problems, though. It can generate 
both false positives and false negatives: a few graphs that 
most people would consider without a significant commu- 
nity structure have a large z-score; on the other hand, 
some graphs that are agreed to display cluster structure 
have very low values for the z-score. Besides, the dis- 
tribution of the maximum modularity values of the null 
model, though peaked, is not Gaussian. Therefore, one 
cannot attribute to the values of the z-score the signifi- 
cance corresponding to a Gaussian distribution, and one 
would need instead to compute the statistical significance 
for the right distribution. 

Reichardt and Bornholdt have studied the issue of the 
modularity values for random graphs in some depth (Re- 
ichardt and Bornholdt, 2006b, 2007), using their general 
spin glass formulation of the clustering problem (Sec- 
tion VLB). They considered the general case of a ran- 
dom graph with arbitrary degree distribution P{k) and 
without degree-degree correlations. They set 7 = 1, so 
that the energy of the ground state coincides with mod- 
ularity (up to a constant factor). For modularity's null 
model graphs, the modularity maximum corresponds to 
an equipartition of the graph, i. e. the magnetization of 
the ground state of the spin glass is zero, a result con- 
firmed by numerical simulations (Reichardt and Born- 
holdt, 2006b, 2007). This is because the distribution of 
the couplings has zero mean, and the mean is only cou- 
pled to magnetization (Fu and Anderson, 1986). For a 
partition of any graph with n vertices and m edges in q 
clusters with equal numbers of vertices, there is a simple 
linear relation between the cut size Cq of the partition 
and its modularity Qq] Cq = m[{q — l)/q — Qq]. We 



remind that the cut size Cq is the total number of inter- 
cluster edges of the partition (Section IV. A). In this way, 
the partition with maximum modularity is also the one 
with minimum cut size, and community detection be- 
comes equivalent to graph partitioning. Reichardt and 
Bornholdt derived analytically the ground state energy 
for Ising spins {q = 2), which corresponds to the fol- 
lowing expression of the expected maximum modularity 
Qmax j^Qj. ^ bipartition (Reichardt and Bornholdt, 2007) 

QT^ = UoJ^-jj^- (52) 

Here {k") = J P{k)k°'dk and Uq is the ground state en- 
ergy of the Sherrington-Kirkpatrick model (Sherrington 
and Kirkpatrick, 1975). The most interesting feature of 
Eq. 52 is the simple scaling with Numerical 
calculations show that this scaling holds for both Erdos- 
Renyi and scale-free graphs (Section A. 3). Interestingly, 
the result is valid for partitions in q clusters, where q is 
left free, not only for q = 2. The number of clusters of the 
partition with maximum modularity decreases if the av- 
erage degree (k) increases, and tends to 5 for large values 
of (fc), regardless of the degree distribution and the size 
of the graph. From Eq. 52 we also see that the expected 
maximum modularity for a random graph increases when 
(k) decreases, i. e. if the graph gets sparser. So it is par- 
ticularly hard to detect communities in sparse graphs by 
using modularity optimization. As we shall see in Sec- 
tion XIV, the sparsity of a graph is generally a serious 
obstacle for graph clustering methods, no matter if one 
uses modularity or not. 

A more fundamental issue, raised by Fortunato and 
Barthelemy (Fortunato and Barthelemy, 2007), concerns 
the capability of modularity to detect "good" partitions. 
If a graph has a clear cluster structure, one expects that 
the maximum modularity of the graph reveals it. The 
null model of modularity assumes that any vertex i "sees" 
any other vertex j, and the expected number of edges 
between them is pij = kikj/2fn. Similarly, the expected 
number of edges between two clusters A and B with total 
degrees Kj^ and i^g, respectively, is Pj^b = K^K^/^m. 
The variation of modularity determined by the merger of 
A and B with respect to the partition in which they are 
separate clusters is AQ^e = Iab/'^ — Kji^Kis/^m? ^ with 
Z_4B number of edges connecting A to B. If = 1, i. e. 
there is a single edge joining A to S, we expect that the 
two subgraphs will often be kept separated. Instead, if 
KjxKs/^m < 1, AQ^e > 0. Let us suppose for simplic- 
ity that Kj^ ~ Kb = K, i. e. that the two subgraphs 
are of about the same size, measured in terms of edges. 
We conclude that, if K <^ \/2ra and the two subgraphs 
A and B are connected, modularity is greater if they are 
in the same cluster (Fortunato and Barthelemy, 2007). 
The reason is intuitive: if there are more edges than 
expected between A and B, there is a strong topologi- 
cal correlation between the subgraphs. If the subgraphs 
are sufficiently small (in degree), the expected number 
of edges for the null model can be smaller than one, so 
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FIG. 15 Resolution limit of modularity optimization. The 
natural community structure of the graph, represented by the 
individual cliques (circles), is not recognized by optimizing 
modularity, if the cliques are smaller than a scale depending 
on the size of the graph. In this case, the maximum modu- 
larity corresponds to a partition whose clusters include two 
or more cliques (like the groups indicated by the dashed con- 
tours). Reprinted figure with permission from Ref. (Fortunato 
and Barthelemy, 2007). ©2007 from the National Academy 
of Science of the USA. 



even the weakest possible connection (a single edge) suf- 
fices to keep the subgraphs together. Interestingly, this 
result holds independently of the structure of the sub- 
graphs. In particular it remains true if the subgraphs are 
cliques, which are the subgraphs with the largest possi- 
ble density of internal edges, and represent the strongest 
possible communities. In Fig. 15 a graph is made out of 
Uc identical cliques, with I vertices each, connected by 
single edges. It is intuitive to think that the clusters of 
the best partition are the individual cliques: instead, if 
ric is larger than about l'^, modularity would be higher 
for partitions in which the clusters are groups of cliques 
(like the clique pairs indicated by the dashed lines in the 
figure). 

The conclusion is striking: modularity optimization 
has a resolution limit that may prevent it from detecting 
clusters which are comparatively small with respect to 
the graph as a whole, even when they are well defined 
communities like cliques. So, if the partition with maxi- 
mum modularity includes clusters with total degree of the 
order of ^/m or smaller, one cannot know a priori whether 
the clusters are single communities or combinations of 
smaller weakly interconnected communities. This resolu- 
tion problem has a large impact in practical applications. 



Real graphs with community structure usually contain 
communities which are very diverse in size (Clausct et ai, 
2004; Danon et at, 2005; Guimera et ai, 2003; Palla 
et al, 2005), so many (small) communities may remain 
undetected. Besides, modularity is extremely sensitive to 
even individual connections. Many real graphs, in biol- 
ogy and in the social sciences, are reconstructed through 
experiments and surveys, so edges may occasionally be 
false positives: if two small subgraphs happen to be con- 
nected by a few false edges, modularity will put them in 
the same cluster, inferring a relationship between entities 
that in reality may have nothing to do with each other. 

The resolution limit comes from the very definition of 
modularity, in particular from its null model. The weak 
point of the null model is the implicit assumption that 
each vertex can interact with every other vertex, which 
implies that each part of the graph knows about every- 
thing else. This is however questionable, and certainly 
wrong for large systems like, e.g., the Web graph. It 
is certainly more reasonable to assume that each vertex 
has a limited horizon within the graph, and interacts just 
with a portion of it. However, nobody knows yet how to 
define such local territories for the graph vertices. The 
null model of the localized modularity of Muff ct al. (Sec- 
tion VLB) is a possibility, since it limits the horizon of 
a vertex to a local neighborhood, comprising the cluster 
of the vertex and the clusters linked to it by at least one 
edge (neighboring clusters). However, there are many 
other possible choices. In this respect, the null model 
of Girvan and Newman, though unrealistic, is the sim- 
plest one can think of, which partly explains its success. 
Quality functions that, like modularity, are based on a 
null model such that the horizon of vertices is of the or- 
der of the size of the whole graph, are likely to be affected 
by a resolution limit (Fortunato, 2007). The problem is 
more general, though. For instance, Li et al. (Li et al., 
2008b) have introduced a quality function, called modu- 
larity density, which consists in the sum over the clusters 
of the ratio between the difference of the internal and 
external degrees of the cluster and the cluster size. The 
modularity density does not require a null model, and de- 
livers better results than modularity optimization (e. g. 
it correctly recovers the natural partition of the graph in 
Fig. 15 for any number/size of the cliques). However, it 
is still affected by a resolution limit. To avoid that, Li et 
al. proposed a more general definition of their measure, 
including a tunable parameter that allows to explore the 
graph at different resolutions, in the spirit of the methods 
of Section XII. 

A way to go around the resolution limit problem could 
be to perform further subdivisions of the clusters ob- 
tained from modularity optimization, in order to elim- 
inate possible artificial mergers of communities. For 
instance, one could recursively optimize modularity for 
each single cluster, taking the cluster as a separate en- 
tity (Fortunato and Barthelemy, 2007; Ruan and Zhang, 
2008). However, this is not a reliable procedure, for two 
reasons: 1) the local modularities used to find partitions 
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within the clusters have different null models, as they de- 
pend on the cluster sizes, so they are inconsistent with 
each other; 2) one needs to define a criterion to decide 
when one has to stop partitioning a cluster, but there is 
no obvious prescription, so any choice is necessarily based 
on arbitrary assumptions^'^. 

Resolution limits arise as well in the more general for- 
mulation of community detection by Reichardt and Born- 
holt (Kumpula et ai, 2007b). Here the limit scale for the 
undetectable clusters is ypjrn. We remind that 7 weighs 
the contribution of the null model term in the quality 
function. For 7=1 one recovers the resolution limit of 
modularity. By tuning the parameter 7 it is possible to 
arbitrarily vary the resolution scale of the corresponding 
quality function. This in principle solves the problem of 
the resolution limit, as one could adjust the resolution of 
the method to the actual scale of the communities to de- 
tect. The problem is that usually one has no information 
about the community sizes, so it is not possible to decide 
a 'priori the proper value(s) of 7 for a specific graph. In 
the most recent literature on graph clustering quite a few 
multiresolution methods have been introduced, address- 
ing this problem in several ways. We will discuss them 
in some detail in Section XII. 

The resolution limit can be easily extended to the case 
of weighted graphs. In a recent paper (Berry et ai, 
2009), Berry et al. have considered the special case in 
which intracluster edges have weight 1, whereas inter- 
cluster edges have weight e. By repeating the same pro- 
cedure as in Ref. (Fortunato and Barthclemy, 2007), they 
conclude that clusters with internal strength (i. e. sum 
of all weights of internal edges) Ws may remain unde- 
tected if Wg < yJ\Vij2 — where W is the total strength 
of the graph. So, the resolution limit decreases when 
e decreases. Berry et al. use this result to show that, 
by properly weighting the edges of a given unweighted 
graph, it becomes possible to detect clusters with very 
high resolution by still using modularity optimization. 

Very recently, Good et al. (Good et at, 2009) have 
made a careful analysis of modularity and its perfor- 
mance. They discovered that the modularity landscape 
is characterized by an exponential number of distinct 
states/partitions, whose modularity values are very close 
to the global maximum (Fig. 16). This problem is partic- 
ularly dramatic if a graph has a hierarchical community 
structure, like most real networks. Such enormous num- 



Ruan and Zhang (Ruan and Zhang, 2008) propose a stopping 
criterion based on the statistical significance of the maximum 
modularity values of the subgraph. The maximum modularity 
of a subgraph is compared with the expected maximum modu- 
larity for a random graph with the same size and expected de- 
gree sequence of the subgraph. If the corresponding z-score is 
sufficiently high, the subgraph is supposed to have community 
structure and one accepts the partition in smaller pieces. The 
procedure stops when none of the subgraphs of the running parti- 
tions has significant community structure, based on modularity. 




FIG. 16 Low-dimensional visualization of the modularity 
landscape for the metabolic network of the spirochete Tre- 
ponema pallidum. The big degeneracy of suboptimal high- 
modularity partitions is revealed by the plateau (whose shape 
is detailed in the inset), which is large and very irregular. 
Modularity values in the plateau are very close to the absolute 
maximum, although they may correspond to quite different 
partitions. Reprinted figure with permission from Ref. (Good 
et al., 2009). 



ber of solutions explains why many heuristics are able to 
come very close to modularity's global maximum, but it 
also implies that the global maximum is basically impos- 
sible to find. In addition, high-modularity partitions are 
not necessarily similar to each other, despite the proxim- 
ity of their modularity scores. The optimal partition from 
a topological point of view, which usually does not corre- 
spond to the modularity maximum due to the resolution 
limit, may however have a large modularity score. There- 
fore the optimal partition is basically indistinguishable 
from a huge number of high-modularity partitions, which 
are in general structurally dissimilar from it. The large 
structural inhomogeneity of the high-modularity parti- 
tions implies that one cannot rely on any of them, at 
least in principle, in the absence of additional informa- 
tion on the particular system at hand and its structure. 



VII. SPECTRAL ALGORITHMS 

In Sections IV. A and IV. D we have learned that spec- 
tral properties of graph matrices are frequently used 
to find partitions. A paradigmatic example is spectral 
graph clustering, which makes use of the eigenvectors of 
Laplacian matrices (Section IV. D). We have also seen 
that Newman-Girvan modularity can be optimized by 
using the eigenvectors of the modularity matrix (Sec- 
tion VI. A. 4). Most spectral methods have been intro- 
duced and developed in computer science and generally 
focus on data clustering, although applications to graphs 
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are often possible as well. In this section we shall review 
recent spectral techniques proposed mostly by physicists 
explicitly for graph clustering. 

Early works have shown that the eigenvectors of the 
transfer matrix T (Section A. 2) can be used to extract 
useful information on community structure. The trans- 
fer matrix acts as a time propagator for the process of 
random walk on a graph. Given the eigenvector c" of 
the transposed transfer matrix T^, corresponding to the 
eigenvalue Aq,, cf is the outgoing current flowing from 
vertex i, corresponding to the eigenmode a. The partic- 
ipation ratio (PR) 



Xa 



.i=l 



(53) 



indicates the effective number of vertices contributing to 
eigenvector c". If Xa receives contributions only from 
vertices of the same cluster, i. e. eigenvector c" is "lo- 
calized" , the value of Xa indicates the size of that clus- 
ter (Eriksen et at, 2003; Simonsen et al, 2004). The sig- 
nificance of the cluster can be assessed by comparing Xa 
with the corresponding participation ratio for a random 
graph with the same expected degree sequence as the 
original graph. Eigenvectors of the adjacency matrix may 
be localized as well if the graph has a clear community 
structure (Slanina and Zhang, 2005). A recent compre- 
hensive analysis of spectral properties of modular graphs 
has been carried out by Mitrovic and Tadic (Mitrovic and 
Tadic, 2009). 

Donetti and Muhoz have devised an elegant method 
based on the eigenvectors of the Laplacian ma- 
trix (Donetti and Muhoz, 2004). The idea is the same 
as in spectral graph clustering (Section IV. D): since the 
values of the eigenvector components are close for vertices 
in the same community, one can use them as coordinates, 
such that vertices turn into points in a metric space. So, 
if one uses M eigenvectors, one can embed the vertices in 
an Af-dimensional space. Communities appear as groups 
of points well separated from each other, as illustrated in 
Fig. 17. The separation is the more visible, the larger the 
number of dimensions/eigenvectors M . The originality of 
the method consists in the procedure to group the points 
and to extract the partition. Donetti and Muhoz used 
hierarchical clustering (see Section IV. B), with the con- 
straint that only pairs of clusters which have at least one 
interconnecting edge in the original graph are merged. 
Among all partitions of the resulting dendrogram, the 
one with largest modularity is chosen. For the similar- 
ity measure between vertices, Donetti and Munoz used 
both the Euclidean distance and the angle distance. The 
angle distance between two points is the angle between 
the vectors going from the origin of the Af-dimensional 
space to either point. Tests on the benchmark by Girvan 
and Newman (Section XV. A) show that the best results 
are obtained with complete-linkage clustering. The most 
computationally expensive part of the algorithm is the 
calculation of the Laplacian eigenvectors. Since a few 
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FIG. 17 Spectral algorithm by Donetti and Munoz. Ver- 
tex i is represented by the values of the ith components 
of Laplacian eigenvectors. In this example, the graph has 
an ad-hoc division in four communities, indicated by the 
colours. The communities are better separated in two dimen- 
sions (b) than in one (a). Reprinted figure with permission 
from Ref. (Donetti and Mufioz, 2004). ©2004 by lOP Pub- 
lishing and SISSA. 



eigenvectors suffice to get good partitions, one can de- 
termine them with the Lanczos method (Lanczos, 1950). 
The number AI of eigenvectors that are needed to have a 
clean separation of the clusters is not known a priori, but 
one can compute a number A/q > 1 of them and search 
for the highest modularity partition among those deliv- 
ered by the method for all 1 < M < Mg. In a related 
work, Simonsen has embedded graph vertices in space by 
using as coordinates the components of the eigenvectors 
of the right stochastic matrix (Simonsen, 2005). 

Eigenvalues and eigenvectors of the Laplacian matrix 
have been used by Alves to compute the effective con- 
ductances for pairs of vertices in a graph, assuming that 
the latter is an electric network with edges of unit re- 
sistance (Alves, 2007). The conductances enable one to 
compute the transition probabilities for a random walker 
moving on the graph, and from the transition proba- 
bilities one builds a similarity matrix between vertex 
pairs. Hierarchical clustering is applied to join vertices 
in groups. The method can be trivially extended to the 
case of weighted graphs. The algorithm by Alves is rather 
slow, as one needs to compute the whole spectrum of the 
Laplacian, which requires a time 0{n^). Moreover, there 
is no criterion to select which partition(s) of the dendro- 
gram is (are) the best. 

Capocci et al. (Capocci et al, 2005) used eigenvec- 
tor components of the right stochastic matrix R (Sec- 
tion A. 2), that is derived from the adjacency matrix by 
dividing each row by the sum of its elements. The right 
stochastic matrix has similar properties as the Laplacian. 
If the graph has g connected components, the largest g 
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FIG. 18 Basic principle ol the spectral algorithm by Capocci 
et al. (Capocci et ai, 2005). The bottom diagram shows the 
values of the components of the second eigenvector of the 
right stochastic matrix for the graph drawn on the top. The 
three plateaus of the eigenvector components correspond to 
the three evident communities of the graph. Reprinted figures 
with permission from Ref. (Capocci et ai, 2005). ©2005 by 
Elsevier. 



eigenvalues are equal to 1, virith eigenvectors character- 
ized by having equal- valued components for vertices be- 
longing to the same component. In this way, by listing 
the vertices according to the connected components they 
belong to, the components of any eigenvector of R, cor- 
responding to eigenvalue 1, display a step- wise profile, 
with plateaus indicating vertices in the same connected 
component. For connected graphs with cluster structure, 
one can still see plateaus, if communities are only loosely 
connected to each other (Fig. 18). Here the communi- 
ties can be immediately deducted by an inspection of 
the components of any eigenvector with eigenvalue 1. In 
practical cases, plateaus are not clearly visible, and one 
eigenvector is not enough. However, one expects that 
there should be a strong correlation between eigenvector 
components corresponding to vertices in the same clus- 
ter. Capocci et al. derived a similarity matrix, where the 
similarity between vertices i and j is the Pearson correla- 
tion coefficient between their corresponding eigenvector 



components, averaged over a small set of eigenvectors. 
The eigenvectors can be calculated by performing a con- 
strained optimization of a suitable cost function. The 
method can be extended to weighted and directed graphs. 
It is useful to estimate vertex similarities, however it does 
not provide a well-defined partition of the graph. 

Yang and Liu (Yang and Liu, 2008) adopted a recursive 
bisectioning procedure. Communities are subgraphs such 
that the external degree of each vertex does not exceed 
the internal degree {strong communities or LS-sets, see 
Section III.B.2). In the first step of the algorithm, the 
adjacency matrix of the graph is put in approximately 
block-diagonal form. This is done by computing a new 
centrality measure for the vertices, called clustering cen- 
trality. This measure is similar to Bonacich's eigenvector 
centrality (Bonacich, 1972, 1987), which is given by the 
eigenvector of the adjacency matrix corresponding to the 
largest eigenvalue. The clustering centrality of a vertex 
basically measures the probability that a random walker 
starting at that vertex hits a given target. Such proba- 
bility is larger if the origin and the target vertices belong 
to the same cluster than if they do not. If the graph has 
well-separated communities, the values of the clustering 
centrality would be similar for vertices in the same clus- 
ter. In this way, one can rearrange the original adjacency 
matrix by listing the vertices in non-decreasing order of 
their clustering centralities, and blocks would be visi- 
ble. The blocks are then identified by iterative bisection: 
each cluster found at some step is split in two as long as 
the resulting parts are still communities in the strong 
sense, otherwise the procedure stops. The worst-case 
complexity of the method is 0[Kt{n\ogn + m)], where 
K is the number of clusters of the final partition and t 
the (average) number of iterations required to compute 
the clustering centrality with the power method (Golub 
and Loan, 1989). Since t is fairly independent of the 
graph size, the method scales quite well on sparse graphs 
[O(nlogn)]. The main limit of this technique is the 
assumption that communities are defined in the strong 
sense, which is too restrictive. On the other hand, one 
could think of using alternative definitions. 



VIII. DYNAMIC ALGORITHMS 

This Section describes methods employing processes 
running on the graph, focusing on spin-spin interactions, 
random walks and synchronization. 



A. Spin models 

The Potts model is among the most popular models in 
statistical mechanics (Wu, 1982). It describes a system 
of spins that can be in q different states. The interaction 
is ferromagnetic, i. e. it favours spin alignment, so at 
zero temperature all spins are in the same state. If an- 
tiferromagnetic interactions are also present, the ground 
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state of the system may not be the one where all spins 
are aligned, but a state where different spin values co- 
exist, in homogeneous clusters. If Potts spin variables 
are assigned to the vertices of a graph with community 
structure, and the interactions are between neighbour- 
ing spins, it is likely that the structural clusters could 
be recovered from like-valued spin clusters of the sys- 
tem, as there are many more interactions inside com- 
mrmities than outside. Based on this idea, inspired by 
an earlier paper by Blatt et al. (Blatt et al., 1996), Re- 
ichardt and Bornholdt proposed a method to detect com- 
munities that maps the graph onto a zero-temperature 
q-Potts model with nearest-neighbour interactions (Re- 
ichardt and Bornholdt, 2004). The Hamiltonian of the 
model, i. e. its energy, reads 

n = -J^A.,,5(a.,a,-)+7E "^^";'^\ (54) 

ij s=l 

where Aij is the element of the adjacency matrix, S is 
Kronecker's function, rig the number of spins in state s, 
J and 7 are coupling parameters. The energy H is the 
sum of two competing terms: the first is the classical 
ferromagnetic Potts model energy, and favors spin align- 
ment; the second term instead peaks when the spins are 
homogeneously distributed. The ratio 7/ J expresses the 
relative importance of the two terms: by tuning 7/ J one 
can explore different levels of modularity of the system, 
from the whole graph seen as a single cluster to clusters 
consisting of individual vertices. If 7/ J is set to the value 
S{G) of the average density of edges of the graph Q, the 
energy of the system is smaller if spins align within sub- 
graphs such that their internal edge density exceeds S{Q), 
whereas the external edge density is smaller than d{Q), 
i. e. if the subgraphs are clusters (Section III.B.l). The 
minimization of "H is carried out via simulated annealing 
((Kirkpatrick et ai, 1983) and Section VI. A. 2), starting 
from a configuration where spins are randomly assigned 
to the vertices and the number of states q is very high. 
The procedure is quite fast and the results do not de- 
pend on q (provided q is sufficiently high). The method 
also allows to identify vertices shared between communi- 
ties, from the comparison of partitions corresponding to 
global and local energy minima. The Hamiltonian T-L can 
be rewritten as 

H = ^5(a„a,)(7-^.,), (55) 

Kj 

which is the energy of an infinite-range Potts spin glass, 
as all pairs of spins are interacting (neighboring or not) 
and there may be both positive and negative couplings. 
The method can be simply extended to the analysis of 
weighted graphs, by introducing spin couplings propor- 
tional to the edge weights, which amounts to replacing 
the adjacency matrix A with the weight matrix W in 
Eq. 54. Ispolatov et al. (Ispolatov et at, 2006) have 
adopted a similar Hamiltonian as in Eq. 54, with a tun- 
able antiferromagnetic term interpolating between the 



corresponding term of Eq. 54 and the entropy term (pro- 
portional to Us log rig) of the free energy, whose mini- 
mization is equivalent to finding the states of the finite- 
temperature Potts model used by Blatt et al. (Blatt et ai, 
1996). Eq. 55 is at the basis of the successive generaliza- 
tion of modularity with arbitrary null models proposed 
by Reichardt and Bornholdt, that we have discussed in 
Section VLB. 

In another work (S.-W. Son et ai, 2006), Son et al. 
have presented a clustering technique based on the Fer- 
romagnetic Random Field Ising Model (FRFIM) . Given a 
weighted graph with weight matrix W, the Hamiltonian 
of the FRFIM on the graph is 

In Eq. 56 ai — ±1 and Bi are the spin and the ran- 
dom magnetic field of vertex i, respectively. The FRFIM 
has been studied to understand the nature of the spin 
glass phase transition (Middlcton and Fisher, 2002) and 
the disorder-driven roughening transition of interfaces in 
disordered media (Noh and Rieger, 2001, 2002). The 
behavior of the model depends on the choice of the mag- 
netic fields. Son et al. set to zero the magnetic fields 
of all vertices but two, say s and i, for which the field 
has infinite strength and opposite signs. This amounts 
to fix the spins of s and t to opposite values, introduc- 
ing frustration in the system. The idea is that, if s and t 
are central vertices of different communities, they impose 
their spin state to the other community members. So, 
the state of minimum energy is a configuration in which 
the graph is polarized into a subgraph with all positive 
spins and a subgraph with all negative spins, coinciding 
with the communities, if they are well defined. Finding 
the minimum of H is equivalent to solving a maximum- 
flow/minimum-cut problem, which can be done through 
well known techniques of combinatorial optimization, like 
the augmenting path algorithm (Ahuja et ai, 1993). For 
a given choice of s and t, many ground states can be 
found. The vertices that end up in the same cluster in 
all ground states represent the cores of the clusters, which 
are called coteries. Possible vertices not belonging to the 
coteries indicate that the two clusters overlap. In the 
absence of information about the cluster structure of the 
graph, one needs to repeat the procedure for any pair 
of vertices s and t. Picking vertices of the same cluster, 
for instance, would not give meaningful partitions. Son 
et al. distinguish relevant clusters if they are of about 
the same size. The procedure can be iteratively applied 
to each of the detected clusters, considered as a separate 
graph, until all clusters have no community structure any 
more. On sparse graphs, the algorithm has complexity 
0(n^+^), where 6 ~ 1.2, so it is very slow and can be cur- 
rently used for graphs of up to few thousands vertices. If 
one happens to know which are the important vertices 
of the clusters, e.g. by computing appropriate centrality 
values (like degree or site betweenness (Freeman, 1977)), 
the choices for s and t are constrained and the complexity 
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can become as low as O(n^), which enables the analysis 
of systems with millions of vertices. Tests on Barabasi- 
Albert graphs (Section A. 3) show that the latter have no 
community structure, as expected. 



B. Random walk 

Random walks (Hughes, 1995) can also be useful to 
find communities. If a graph has a strong community 
structure, a random walker spends a long time inside a 
community due to the high density of internal edges and 
consequent number of paths that could be followed. Here 
we describe the most popular clustering algorithms based 
on random walks. All of them can be trivially extended 
to the case of weighted graphs. 

Zhou used random walks to define a distance between 
pairs of vertices (Zhou, 2003a): the distance dij between 
i and j is the average number of edges that a random 
walker has to cross to reach j starting from i. Close 
vertices are likely to belong to the same community. 
Zhou defines a "global attractor" of a vertex i to be a 
closest vertex to i (i. e. any vertex lying at the smallest 
distance from i), whereas the "local attractor" of i is 
its closest neighbour. Two types of communities are 
defined, according to local or global attractors: a vertex 
i has to be put in the same community of its attractor 
and of all other vertices for which i is an attractor. 
Communities must be minimal subgraphs, i. e. they 
cannot include smaller subgraphs which are communities 
according to the chosen criterion. Applications to real 
networks, like Zachary's karate club (Zachary, 1977) and 
the college football network compiled by Girvan and 
Newman (Girvan and Newman, 2002) (Section XV. A), 
along with artificial graphs like the benchmark by 
Girvan and Newman (Girvan and Newman, 2002) (Sec- 
tion XV. A), show that the method can find meaningful 
partitions. The method can be refined, in that vertex i 
is associated to its attractor j only with a probability 
proportional to cxp(— /Jdy), /3 being a sort of inverse 
temperature. The computation of the distance matrix 
requires solving n linear-algebra equations (as many 
as the vertices), which requires a time 0(n^). On 
the other hand, an exact computation of the distance 
matrix is not necessary, as the attractors of a vertex 
can be identified by considering only a localized portion 
of the graph around the vertex; therefore the method 
can be applied to large graphs as well. In a successive 
paper (Zhou, 2003b), Zhou introduced a measure of 
dissimilarity between vertices based on the distance 
defined above. The measure resembles the definition 
of distance based on structural equivalence of Eq. 7, 
where the elements of the adjacency matrix are replaced 
by the corresponding distances. Graph partitions are 
obtained with a divisive procedure that, starting from 
the graph as a single community, performs successive 
splits based on the criterion that vertices in the same 
cluster must be less dissimilar than a running threshold. 



which is decreased during the process. The hierarchy 
of partitions derived by the method is representative of 
actual community structures for several real and artifi- 
cial graphs, including Zachary's karate club (Zachary, 
1977), the college football network (Girvan and 
Newman, 2002) and the benchmark by Girvan and 
Newman (Girvan and Newman, 2002) (Section XV. A). 
The time complexity of the procedure is again 0{n^). 
The code of the algorithm can be downloaded from 
http : //www . mpikg-golm . mpg . de/theory/people/zhou 
/networkcommunity . html. 

In another work (Zhou and Lipowsky, 2004), Zhou and 
Lipowsky adopted biased random walkers, where the bias 
is due to the fact that walkers move preferentially towards 
vertices sharing a large number of neighbours with the 
starting vertex. They defined a proximity index, which 
indicates how close a pair of vertices is to all other ver- 
tices. Communities are detected with a procedure called 
NetWalk, which is an agglomerative hierarchical cluster- 
ing method (Section IV. B), where the similarity between 
vertices is expressed by their proximity. The method has 
a time complexity 0{n^): however, the proximity index 
of a pair of vertices can be computed with good approx- 
imation by considering just a small portion of the graph 
around the two vertices, with a considerable gain in time. 
The performance of the method is comparable with that 
of the algorithm of Girvan and Newman (Section V.A). 

A different distance measure between vertices based on 
random walks was introduced by Latapy and Pons (Lat- 
apy and Pons, 2005). The distance is calculated from 
the probabilities that the random walker moves from 
a vertex to another in a fixed number of steps. The 
number of steps has to be large enough to explore 
a significant portion of the graph, but not too long, 
as otherwise one would approach the stationary limit 
in which transition probabilities trivially depend on 
the vertex degrees. Vertices are then grouped into 
communities through an agglomerative hierarchical 
clustering technique based on Ward's method (Ward, 
1963). Modularity (Section III.C.2) is used to select the 
best partition of the resulting dendrogram. The algo- 
rithm runs to completion in a time 0{n^d) on a sparse 
graph, where d is the depth of the dendrogram. Since 
d is often small for real graphs [0(log7i)], the expected 
complexity in practical computations is O(n^logn). 
The software of the algorithm can be found at 
http : //www-rp . Iip6 . f r/ ~latapy/PP/walktrap . html. 

Hu et al. (Hu et al. , 2008) designed a graph clustering 
technique based on a signaling process between vertices, 
somewhat resembling diffusion. Initially a vertex s is as- 
signed one unit of signal, all the others have no signal. In 
the first step, the source vertex s sends one unit of signal 
to each of its neighbors. Next, all vertices send as many 
units of signals they have to each of their neighbors. The 
process is continued until a given number of iterations T 
is reached. The intensity of the signal at vertex j, nor- 
malized by the total amount of signal, is the i-th entry of 
a vector Ug, representing the source vertex s. The proce- 
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dure is then repeated by choosing each vertex as source. 
In this way one can associate an n-dimensional vector to 
each vertex, which correspons to a point in an EucUdean 
space. The vector Ug is actuahy the s-th column of the 
matrix (I + A)"^, where I and A are the identity and 
adjacency matrix, respectively. The idea is that the vec- 
tor Ug describes the influence that vertex s exerts on the 
graph through signaling. Vertices of the same commu- 
nity are expected to have similar influence on the graph 
and thus to correspond to vectors which are "close" in 
space. The vectors are finally grouped via fuzzy /c-means 
clustering (Section IV.C). The optimal number of clus- 
ters corresponds to the partition with the shortest aver- 
age distance between vectors in the same community and 
the largest average distance between vectors of different 
communities. The signaling process is similar to diffu- 
sion, but with the important difference that here there is 
no flow conservation, as the amount of signal at each ver- 
tex is not distributed among its neighbors but transferred 
entirely to each neighbor (as if the vertex sent multiple 
copies of the same signal). The complexity of the algo- 
rithm is 0[T{{k) + l)n^], where (fc) is the average degree 
of the graph. Like in the previous algorithm by Latapy 
and Pons (Latapy and Pons, 2005), finding an optimal 
value for the number of iterations T is non-trivial. 

Delvenne et al. (Delvenne et ai, 2008) have shown that 
random walks enable one to introduce a general quality 
function, expressing the persistence of clusters in time. A 
cluster is persistent with respect to a random walk after 
t time steps if the probability that the walker escapes the 
cluster before t steps is low. Such probability is computed 
via the clustered autocovariance matrix Rt, which, for a 
partition of the graph in c clusters, is defined as 

Rf = H'^(nM* - 7r^7r)H. (57) 

Here, H is the n x c membership matrix, whose element 
Hij equals one if vertex i is in cluster j, zero otherwise; 
M is the transition matrix of the random walk; 11 the 
diagonal matrix whose elements are the stationary prob- 
abilities of the random walk, i. e. 11^^ = ki/2m, ki being 
the degree of vertex i; tt is the vector whose entries are 
the diagonal elements of 11. The element {Rt)ij expresses 
the probability for the walk to start in cluster i and end 
up in cluster j after t steps, minus the stationary proba- 
bility that two independent random walkers are in i and 
j. In this way, the persistence of a cluster i is related to 
the diagonal element {Rt)u- Delvenne et al. defined the 
stability of the clustering 

c 

r(i;H)= min > (-Rs)ii = min trace[i?sl. (58) 

0<s<t-^ 0<s<t 
^ ~ i=l ^ ^ 

The aim is then, for a given time i, finding the partition 
with the largest value for r(t;H). For t — 0, the most 
stable partition is that in which all vertices are their own 
clusters. Interestingly, for t = 1, maximizing stability 
is equivalent to maximizing Newman-Girvan modular- 
ity (Section III.C.2). The cut size of the partition (Sec- 
tion IV. A) equals [r(0) — r(l)], so it is also a one-step 



measure. In the limit t ^ oo, the most stable partition 
coincides with the Fiedler partition (Fiedler, 1973, 1975), 
i. e. the bipartition where vertices are put in the same 
class according to the signs of the corresponding compo- 
nent of the Fiedler eigenvector (Section IV. A). Therefore, 
the measure r{t; H) is very general, and gives a unify- 
ing interpretation in the framework of the random walk 
of several measures that were defined in different con- 
texts. In particular, modularity has a natural interpre- 
tation in this dynamic picture (Lambiotte et al., 2008). 
Since the size of stable clusters increases with t, time 
can be considered as a resolution parameter. Resolution 
can be fine tuned by taking time as a continuous vari- 
able (the extension of the formalism is straightforward); 
the linearization of the stability measure at small (con- 
tinuous) times delivers multiresolution versions of mod- 
ularity (Arenas et al., 2008b; Reichardt and Bornholdt, 
2006a) (Section XII. A). 

In a method by Weinan et al. (Weinan et al., 2008), 
the best partition of a graph in k clusters is such that the 
Markov chain describing a random walk on the meta- 
graph, whose vertices are the clusters of the original 
graph, gives the best approximation of the full random 
walk dynamics on the whole graph. The quality of the 
approximation is given by the distance between the left 
stochastic matrices of the two processes, which thus needs 
to be minimized. The minimization is performed by using 
a variant of the fc-means algorithm (Section IV.C), and 
the result is the best obtained out of I runs starting from 
different initial conditions, a strategy that considerably 
improves the quality of the optimum. The time com- 
plexity is 0[tlk{n -\- m)], where t is the number of steps 
required to reach convergence. The optimal number of 
clusters could in principle be determined by analyzing 
how the quality of the approximation varies with k, but 
the authors do not give any general recipe. The method 
is rather accurate on the benchmark by Girvan and New- 
man (Girvan and Newman, 2002) (Section XV. A) and on 
Zachary's karate club network. The algorithm by Weinan 
et al. is asymptotically equivalent to spectral graph par- 
titioning (Section IV. D) when the Markov chain describ- 
ing the random walk presents a sizeable spectral gap be- 
tween some of the largest eigenvalues of the transfer ma- 
trix (Section A. 2), approximately equal to one, and the 
others. 

We conclude this section by describing the Markov 
Cluster Algorithm (MCL), which was invented by Van 
Dongen (Dongen, 2000a). This method simulates a pe- 
culiar process of flow diffusion in a graph. One starts 
from the transfer matrix of the graph T (Section A. 2). 
The element T^- of the transfer matrix gives the proba- 
bility that a random walker, sitting at vertex j, moves to 
i. The sum of the elements of each column of T is one. 
Each iteration of the algorithm consists of two steps. In 
the first step, called expansion, the transfer matrix of the 
graph is raised to an integer power p (usually p — 2). The 
entry M^j of the resulting matrix gives the probability 
that a random walker, starting from vertex j, reaches i 
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in p steps (diffusion flow). Tfie second step, wfiich fias no 
physical counterpart, consists in raising each single entry 
of the matrix M to some power a, where a is now real- 
valued. This operation, called inflation, enhances the 
weights between pairs of vertices with large values of the 
diffusion flow, which are likely to be in the same commu- 
nity. Next, the elements of each column must be divided 
by their sum, such that the sum of the elements of the 
column equals one and a new transfer matrix is recov- 
ered. After some iterations, the process delivers a stable 
matrix, with some remarkable properties. Its elements 
are either zero or one, so it is a sort of adjacency matrix. 
Most importantly, the graph described by the matrix is 
disconnected, and its connected components are the com- 
munities of the original graph. The method is really sim- 
ple to implement, which is the main reason of its success: 
as of now, the MCL is one of the most used clustering al- 
gorithms in bioinformatics. The code can be downloaded 
from http://www.miccLns.org/mcl/. Due to the ma- 
trix multiplication of the expansion step, the algorithm 
should scale as 0{n^), even if the graph is sparse, as the 
running matrix becomes quickly dense after a few steps 
of the algorithm. However, while computing the matrix 
multiplication, MCL keeps only a maximum number k of 
non-zero elements per column, where k is usually much 
smaller than n. So, the actual worst-case running time of 
the algorithm is 0{nk^) on a sparse graph. A problem of 
the method is the fact that the final partition is sensitive 
to the parameter a used in the infiation step. Therefore 
several different partitions can be obtained, and it is not 
clear which are the most meaningful or representative. 



C. Synchronization 

Synchronization (Pikovsky et al., 2001) is an emergent 
phenomenon occurring in systems of interacting units 
and is ubiquitous in nature, society and technology. In 
a synchronized state, the units of the system are in the 
same or similar state(s) at every time. Synchronization 
has also been applied to find communities in graphs. If 
oscillators are placed at the vertices, with initial random 
phases, and have nearest-neighbour interactions, oscilla- 
tors in the same community synchronize first, whereas 
a full synchronization requires a longer time. So, if one 
follows the time evolution of the process, states with syn- 
chronized clusters of vertices can be quite stable and long- 
lived, so they can be easily recognized. This was first 
shown by Arenas, Diaz-Guilera and Perez- Vicente (Are- 
nas et ai, 2006). They used Kuramoto oscillators (Ku- 
ramoto, 1984), which are coupled two-dimensional vec- 
tors endowed with a proper frequency of oscillations. In 
the Kuramoto model, the phase 9i of oscillator i evolves 
according to the following dynamics 

^=uj,+Y,Ksmie,~e,), (59) 
j 



where uji is the natural frequency of i, K the strength of 
the coupling between oscillators and the sum runs over 
all oscillators (mean field regime) . If the interaction cou- 
pling exceeds a threshold, depending on the width of the 
distribution of natural frequencies, the dynamics leads to 
synchronization. If the dynamics runs on a graph, each 
oscillator is coupled only to its nearest neighbors. In or- 
der to reveal the effect of local synchronization. Arenas 
et al. introduced the local order parameter 

p,j{t) = {cos.[e,{t)~ej{t)]), (60) 

measuring the average correlation between oscillators i 
and j . The average is computed over different initial con- 
ditions. By visualizing the correlation matrix p{t) at a 
given time t, one may distinguish groups of vertices that 
synchronize together. The groups can be identified by 
means of the dynamic connectivity matrix 'Dt{T), which 
is a binary matrix obtained from p(t) by thresholding its 
entries. The dynamic connectivity matrix embodies in- 
formation about both the synchronization dynamics and 
the underlying graph topology. From the spectrum of 
I?t (T) it is possible to derive the number of disconnected 
components at time t. By plotting the number of compo- 
nents as a function of time, plateaus may appear at some 
characteristic time scales, indicating structural scales of 
the graph with robust communities (Fig. 19). Partitions 
corresponding to long plateaus are characterized by high 
values of the modularity of Newman and Girvan (Sec- 
tion III.C.2) on graphs with homogeneous degree distri- 
butions, whereas such correlation is poor in the presence 
of hubs (Arenas and Dfaz-Guilera, 2007). Indeed, it has 
been proven that the stability (Eq. 58) of the dynamics 
associated to the standard Laplacian matrix, which de- 
scribes the convergence towards synchronization of the 
Kuramoto model with equal intrinsic frequencies, coin- 
cides with modularity only for graphs whose vertices have 
the same degree (Lambiottc et al., 2008). The appear- 
ance of plateaus at different time scales hints to a hierar- 
chical organization of the graph. After a sufficiently long 
t all oscillators are synchronized and the whole system be- 
haves as a single component. Interestingly, Arenas et al. 
found that the structural scales revealed by synchroniza- 
tion correspond to groups of eigenvalues of the Laplacian 
matrix of the graph, separated by gaps (Fig. 19). 

Based on the same principle, Boccaletti et al. de- 
signed a community detection method based on synchro- 
nization (Boccaletti et ai, 2007). The synchronization 
dynamics is a variation of Kuramoto's model, the opin- 
ion changing rate (OCR) model (Pluchino et ai, 2005). 
Here the interaction coupling between adjacent vertices 
is weighted by a term proportional to a (negative) power 
of the betweenness of the edge connecting the vertices 
(Section V.A), with exponent a. The evolution equa- 
tions of the model are solved by decreasing the value of 
a during the evolution of the dynamics, starting from a 
configuration in which the system is fully synchronized 
(a = 0). The graph tends to get split into clusters of 
synchronized elements, because the interaction strengths 
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FIG. 19 Synchronization of Kuramoto oscillators on graphs with two hierarchical levels of communities. (Top) The number 
of different synchronized components is plotted versus time for two graphs with different densities of edges within the clusters. 
(Bottom) The rank index of the eivenvalues of the Laplacian matrices of the same two graphs of the upper panels is plotted 
versus the inverse eigenvalues (the ranking goes from the largest to the smallest eigenvalue). The two types of communities are 
revealed by the plateaus. Reprinted figure with permission from Ref. (Arenas et al, 2006). ©2006 by the American Physical 
Society. 



across inter-cluster edges get suppressed due to their high 
betweenness scores. By varying a, different partitions are 
recovered, from the graph as a whole until the vertices 
as separate communities: the partition with the largest 
value of modularity is taken as the most relevant. The 
algorithm scales in a time 0(mn), or O(n^) on sparse 
graphs, and gives good results in practical examples, in- 
cluding Zachary's karate club (Zachary, 1977) and the 
benchmark by Girvan and Newman (Girvan and New- 
man, 2002) (Section XV. A). The method can be refined 
by homogeneizing the natural frequencies of the oscilla- 
tors during the evolution of the system. In this way, the 
system becomes more stable and partitions with higher 
modularity values can be recovered. 

In a recent paper by Li et al. (Li et ai, 2008a), it was 



shown that synchronized clusters in modular networks 
are characterized by interfacial vertices, whose oscillation 
frequency is intermediate between those of two or more 
clusters, so that they do not belong to a specific commu- 
nity. Li et al. used this result to devise a technique able 
to detect overlapping communities. 

Synchronization-based algorithms may not be reliable 
when communities are very different in size; tests in this 
direction are still missing. 



IX. METHODS BASED ON STATISTICAL INFERENCE 

Statistical inference (Mackay, 2003) aims at deducing 
properties of data sets, starting from a set of observa- 
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tions and model hypotheses. If the data set is a graph, 
the model, based on hypotheses on how vertices are con- 
nected to each other, has to fit the actual graph topol- 
ogy. In this section we review those clustering tech- 
niques attempting to find the best fit of a model to the 
graph, where the model assumes that vertices have some 
sort of classification, based on their connectivity pat- 
terns. We mainly focus on methods adopting Bayesian 
inference (Winkler, 2003), in which the best fit is ob- 
tained through the maximization of a likelihood {gen- 
erative models), but we also discuss related techniques, 
based on hlockmodeling (Doreian et al, 2005), model se- 
lection (Burnham and Anderson, 2002) and information 
theory (Mackay, 2003). 

A. Generative models 

Bayesian inference uses observations to estimate the 
probability that a given hypothesis is true. It con- 
sists of two ingredients: the evidence, expressed by 
the information D one has about the system (e.g., 
through measurements); a statistical model with param- 
eters {6}. Bayesian inference starts by writing the like- 
lihood P{D\{9}) that the observed evidence is produced 
by the model for a given set of parameters {9}. The aim 
is to determine the choice of {9} that maximizes the pos- 
terior distribution P{{9}\D) of the parameters given the 
model and the evidence. By using Bayes' theorem one 
has 

P{{9}\D) ^P{D\{9})P{{9}), (61) 

where P{{9}) is the prior distribution of the model pa- 
rameters and 

Z ^ J P{D\{9})Pm)dO. (62) 

Unfortunately, computing the integral 62 is a major chal- 
lenge. Moreover, the choice of the prior distribution 
P{{9}) is non-obvious. Generative models differ from 
each other by the choice of the model and the way they 
address these two issues. 

Bayesian inference is frequently used in the analysis 
and modeling of real graphs, including social (Handcock 
et al., 2007; Koskincn and Snijders, 2007; Rhodes and 
Keefe, 2007) and biological networks (Berg and Lassig, 
2006; Rowicka and Kudlicki, 2004). Graph clustering can 
be considered a specific example of inference problem. 
Here, the evidence is represented by the graph structure 
(adjacency or weight matrix) and there is an additional 
ingredient, represented by the classification of the ver- 
tices in groups, which is a hidden (or missing) informa- 
tion that one wishes to infer along with the parameters 
of the model which is supposed to be responsible for the 
classification. This idea is at the basis of several recent 
papers, which we discuss here. In all these works, one 
essentially maximizes the likelihood P{D\{9}) that the 



model is consistent with the observed graph structure, 
with different constraints. We specify the set of param- 
eters {9} as the triplet {{q}, {tt}, k), where {q} indicates 
the community assignment of the vertices, {tt} the model 
parameters, and k the number of clusters. In the follow- 
ing we shall stick to the notation of the papers, so the 
variables above may be indicated by different symbols. 
However, to better show what each method specifically 
does we shall refer to our general notation at the end of 
the section. 

Hastings (Hastings, 2006) chooses as a model of net- 
work with communities the planted partition model (Sec- 
tion XV). In it, n vertices are assigned to q groups: ver- 
tices of the same group are linked with a probability pin, 
while vertices of different groups are linked with a prob- 
ability Pout- Pin > Pout, the model graph has a built-in 
community structure. The vertex classification is indi- 
cated by the set of labels {qi}. The probability that, 
given a graph, the classification {qi} is the right one ac- 
cording to the model is^'* 

p{{q.}) cx {exp[- J2 J^,. - E J'K<ij2]r\ (63) 

{ij} 

where J = log{[pi„(l - Pout)]/[Pout{l ~ Pin)]}, J' = 
log[(l— pi„)/(l— PoMt)] and the first sum runs over nearest 
neighboring vertices. Maximizing p{{qi}) is equivalent 
to minimizing the argument of the exponential, which is 
the Hamiltonian of a Potts model with short- and long- 
range interactions. For pin > Pout, J > and J' < 0, 
so the model is a spin glass with ferromagnetic nearest- 
neighbor interactions and antiferromagnetic long-range 
interactions, similar to the model proposed by Reichardt 
and Bornholdt to generalize Newman-Girvan modular- 
ity (Reichardt and Bornholdt, 2006a) (Section VLB). 
Hastings used belief propagation (Gallager, 1963) to 
find the ground state of the spin model. On sparse 
graphs, the complexity of the algorithm is expected to 
be 0(nlog"n), where a needs to be estimated numeri- 
cally. In principle one needs to input the parameters pin 
and Pout, which are usually unknown in practical appli- 
cations. However, it turns out that they can be chosen 
rather arbitrarily, and that bad choices can be recognized 
and corrected. 

Newman and Leicht (Newman and Leicht, 2007) 
have recently proposed a similar method based on a 
mixture model and the expectation-maximization tech- 
nique (Dempster et al, 1977). The method bears some 
resemblance with an a posteriori blockmodel previously 
introduced by Snijders and Nowicki (Nowicki and Sni- 
jders, 2001; Snijders and Nowicki, 1997). They start from 
a directed graph with n vertices, whose vertices fall into 
c classes. The group of vertex i is indicated by g^, tt,. the 



The actual likelihood includes an additional factor expressing the 
a priori probability of the community sizes. Hastings assumes 
that this probability is constant. 
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fraction of vertices in group r, and 9ri the probability 
that there is a directed edge from vertices of group r to 
vertex i. By definition, the sets {tTj} and {9ri} satisfy the 
normahzation conditions X]r=i ""r = 1 and ^ri = 1- 

Apart from normalization, the probabilities {Ori} are as- 
sumed to be independent of each other. The best classifi- 
cation of the vertices corresponds to the maximum of the 
average log-likelihood C that the model, described by the 
values of the parameters {ni} and {dri} fits the adjacency 
matrix A of the graph. The expression of the average 
log-likelihood C requires the definition of the probability 
Qir = Pr{gi = r\A, tt, 6*), that vertex i belongs to group r. 
By applying Bayes' theorem the probabilities {qir} can 
be computed in terms of the {-Ki} and the {9ri\, as 



"rj 



(64) 



while the maximization of the average log-likelihood £, 
under the normalization constraints of the model vari- 
ables {tt,;} and {^ri}, yields the relations 



(65) 



where ki is the outdegree of vertex i. Equations 64 and 65 
are self-consistent, and can be solved by iterating them 
to convergence, starting from a suitable set of initial con- 
ditions. Convergence is fast, so the algorithm could be 
applied to fairly large graphs, with up to about 10^ ver- 
tices. 

The method, designed for directed graphs, can be eas- 
ily extended to the undirected case, whereas an extension 
to weighted graphs is not straightforwad. A nice feature 
of the method is that it does not require any preliminary 
indication on what type of structure to look for; the re- 
sulting structure is the most likely classification based on 
the connectivity patterns of the vertices. Therefore, var- 
ious types of structures can be detected, not necessarily 
communities. For instance, multipartite structure could 
be uncovered, or mixed patterns where multipartite sub- 
graphs coexist with communities, etc.. In this respect, it 
is more powerful than most methods of community de- 
tection, which are bound to focus only on proper commu- 
nities, i. e. subgraphs with more internal than external 
edges. In addition, since partitions are defined by as- 
signing probability values to the vertices, expressing the 
extent of their membership in a group, it is possible that 
some vertices are not clearly assigned to a group, but to 
more groups, so the method is able to deal with overlap- 
ping communities. The main drawback of the algorithm 
is the fact that one needs to specify the number of groups 
c at the beginning of the calculation, a number that is 
typically unknown for real networks. It is possible to de- 
rive this information self-consistently by maximizing the 
probability that the data are reproduced by partitions 
with a given number of clusters. But this procedure in- 
volves some degree of approximation, and the results are 
often not good. 




c) 




FIG. 20 Problem of method by Newman and Leicht. By ap- 
plying the method to the illustrated complete bipartite graph 
(colors indicate the vertex classes) the natural group structure 
c) is not recovered; instead, the most likely classifications are 
a) and b). Reprinted figure with permission from Ref. (Ra- 
masco and Mungan, 2008). ©2008 by the American Physical 
Society. 



In a recent study it has been shown that the method 
by Newman and Leicht enables one to rank vertices 
based on their degree of influence on other vertices, 
which allows to identify the vertices responsible for the 
group structure and its stability (Mungan and Ramasco, 
2008). A very similar technique has also been applied 
by Vazquez (Vazquez, 2008) to the problem of popula- 
tion stratification, where animal populations and their 
attributes are represented as hypergraphs (Section A.l). 
Vazquez also suggested an interesting criterion to decide 
the optimal number of clusters, namely picking the num- 
ber c whose solution has the greatest similarity with so- 
lutions obtained at different values of c. The similarity 
between two partitions can be estimated in various ways, 
for instance by computing the normalized mutual infor- 
mation (Section XV). In a successive paper (Vazquez, 
2008), Vazquez showed that better results are obtained 
if the classification likelihood is maximized by using Vari- 
ational Bayes (Bcal, 2003; Jordan et ai, 1999). 

Ramasco and Mungan (Ramasco and Mungan, 2008) 
remarked that the normalization condition on the prob- 
abilities {9ri} implies that each group r must have non- 
zero outdegree and that therefore the method fails to 
detect the intuitive group structure of (directed) bi- 
partite graphs (Fig. 20). To avoid this problem, they 
proposed a modification, that consists in introducing 
three sets for the edge probabilities {Ori}, relative to 
edges going from group r to vertex i (as before), from 
i to r and in both directions, respectively. Further- 
more, they used the average entropy of the classification 
Sq = — (X)i r ftr In <lir)l'n, whcrc the qir are the analogs 
of the probabilities in Eq. 64, to infer the optimal num- 
ber of groups, that the method of Newman and Leicht is 
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unable to provide. Another technique similar to that by 
Newman and Leicht has been designed by Ren et al. (Ren 
et ai, 2009). The model is based on the group fractions 
{TTi}, defined as above, and a set of probabilities {f3r.i}, 
expressing the relevance of vertex i for group r; the basic 
assumption is that the probability that two vertices of 
the same group are connected by an edge is proportional 
to the product of the relevances of the two vertices. In 
this way, there is an explicit relation between group mem- 
bership and edge density, and the method can only de- 
tect community structure. The community assignments 
are recovered through an expectation-maximization pro- 
cedure that closely follows that by Newman and Leicht. 

Maximum likelihood estimation has been used by 
Copic et al. to define an axiomatization of the prob- 
lem of graph clustering and its related concepts (Copic 
et al., 2005). The starting point is again the planted 
partition model (Section XV), with probabilities Pm and 
Pout- A novelty of the approach is the introduction of 
the size matrix S, whose element Sij indicates the max- 
imum strength of interaction between vertices i and j. 
For instance, in a graph with unweighted connections, 
all elements of S equal 1. In this case, the probabil- 
ity that the graph conceals a community structure co- 
incides with the expression (63) by Hastings. Copic et 
al. used this probability as a quality function to define 
rankings between graph partitions {likelihood rankings). 
The authors show that the likelihood rankings satisfy a 
number of general properties, which should be satisfied 
by any reasonable ranking. They also propose an algo- 
rithm to find the maximum likelihood partition, by using 
the auxiliary concept of pseudo-community .structure, i. 
e. a grouping of the graph vertices in which it is speci- 
fied which pairs of vertices stay in the same community 
and which pairs instead stay in different communities. A 
pseudo-community may not be a community because the 
transitive property is not generally valid, as the focus is 
on pairwise vertex relationships: it may happen that i 
and j are classified in the same group, and that j and 
k are classified in the same group, but that i and k are 
not classified as belonging to the same group. We believe 
that the work by Copic et al. is an important first step 
towards a more rigorous formalization of the problem of 
graph clustering. 

Zanghi et al. (Zanghi et al., 2008) have designed a 
clustering technique that lies somewhat in between the 
method by Hastings and that by Newman and Leicht. 
As in Ref. (Hastings, 2006), they use the planted parti- 
tion model to represent a graph with community struc- 
ture; as in Ref. (Newman and Leicht, 2007), they max- 
imize the classification likelihood using an expectation- 
maximization algorithm (Dempster et al, 1977). The 
algorithm runs for a fixed number of clusters q, like that 
by Newman and Leicht; however, the optimal number 
of clusters can be determined by running the algorithm 
for a range of g-values and selecting the solution that 
maximizes the Integrated Classification Likelihood intro- 
duced by Biernacki et al. (Bicrnacki et ai, 2000). The 



time complexity of the algorithm is O(n^). 



Hofman and Wiggins have proposed a general Bayesian 
approach to the problem of graph clustering (Hofman 
and Wiggins, 2008). Like Hastings (Hastings, 2006), 
they model a graph with community structure as in the 
planted partition problem (Section XV), in that there 
are two probabilities 9c and 9d that there is an edge 
between vertices of the same or different clusters, re- 
spectively. The unobserved community structure is in- 
dicated by the set of labels <j for the vertices; tt^ is 
again the fraction of vertices in group r. The con- 
jugate prior distributions p{0) and p{Tf) are chosen to 
be Beta and Dirichlet distributions. The most prob- 
able number of clusters K* maximizes the conditional 
probability p{K\A) that there are K clusters, given the 
matrix A. Like Hastings, Hofman and Wiggins as- 
sume that the prior probability p{K) on the number 
of clusters is a smooth function, therefore maximizing 
p{K\A) amounts to maximizing the Bayesian evidence 
p{A\K) cx p{K\A)/p{K), obtained by integrating the 
joint distribution p{A,(t\'!t,0, K), which is factorizable, 
over the model parameters and tt. The integration 
can be performed exactly only for small graphs. Hofman 
and Wiggins used Variational Bayes (Bcal, 2003; Jordan 
et ai, 1999), in order to compute controlled approxima- 
tions of p{A\K). The complexity of the algorithm was es- 
timated numerically on synthetic graphs, yielding 0(n"), 
with a = 1.44. In fact, the main limitation comes from 
high memory requirements. The method is more power- 
ful than the one by Hastings (Hastings, 2006), in that the 
edge probabilities are inferred by the procedure itself 
and need not be specified (or guessed) at the beginning. 
It also includes the expectation-maximization approach 
by Newman and Leicht (Newman and Leicht, 2007) as a 
special case, with the big advantage that the number of 
clusters need not be given as an input, but is an output of 
the method. The software of the algorithm can be found 
at http : / / www . Columbia . edu/~chw2/. 



We conclude with a brief summary of the main tech- 
niques described above, coming back to our notation 
at the beginning of the section. In the method by 
Hastings, one maximizes the likelihood P{D\{q] ,{11} ,k) 
over the set of all possible community assignments 
{q\, given the number of clusters k and the model 
parameters (i. e. the linking probabilities pin and 
Pout). Newman and Leicht maximize the likelihood 
P{D\{q}, {tt}, k) for a given number of clusters, over the 
possible choices for the model parameters and commu- 
nity assignments, by deriving the optimal choices for 
both variables with a self-consistent procedure. Hof- 
man and Wiggins maximize the likelihood PHw{k) — 
Y.i,}! P{D\{<l}AT^}.k)P[{q)\{ir))P{{^})dT, over the 
possible choices for the number of clusters. 
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B. Blockmodeling, model selection and Information theory 

Block modeling is a common approach in statistics and 
social network analysis to decompose a graph in classes 
of vertices with common properties. In this way, a sim- 
pler description of the graph is attained. Vertices arc 
usually grouped in classes of equivalence. There are two 
main definitions of topological equivalence for vertices: 
structural equivalence (F.Lorrain and White, 1971) (Sec- 
tion III.B.4), in which vertices are equivalent if they have 
the same neighbors^^; regular equivalence (Everett and 
Borgatti, 1994; White and Rcitz, 1983), in which vertices 
of a class have similar connection patterns to vertices of 
the other classes (ex. parents/children). Regular equiv- 
alence does not require that ties/edges are restricted to 
specific target vertices, so it is a more general concept 
than structural equivalence. Indeed, vertices which are 
structurally equivalent are also regularly equivalent, but 
the inverse is not true. The concept of structural equiva- 
lence can be generalized to probabilistic models, in which 
one compares classes of graphs, not single graphs, charac- 
terized by a set of linking probabilities between the ver- 
tices. In this case, vertices are organized in classes such 
that the linking probabilities of a vertex with all other 
vertices of the graph are the same for vertices in the same 
class, which are called stochastically equivalent (Fienbcrg 
and Wasserman, 1981; Holland et ai, 1983). 

A thorough discussion of blockmodeling is beyond the 
scope of this review: we point the reader to Ref. (Dorcian 
et ai, 2005). Here we discuss a recent work by Reichardt 
and White (Reichardt and White, 2007). Let us suppose 
to have a directed graph with n vertices and m edges. A 
classification of the graph is indicated by the set of labels 
{cr}, where ai — 1,2,..., q is the class of vertex i. The 
corresponding blockmodel, or image graph, is expressed 
hy a qxq adjacency matrix B: Bq-^q^ = 1 if edges between 
classes qi and q2 are allowed, otherwise it is zero. The 
aim is finding the classification {a} and the matrix B 
that best fits the adjacency matrix A of the graph. The 
goodness of the fit is expressed by the quality function 

Q'^iM) ^ -Y.^a,,A,^B,^,^ +b,j{l- A,^){1- B^^^J], 

(66) 

where aij (bij) reward the presence (absence) of edges 
between vertices if there are edges (non-edges) between 
the corresponding classes, and m is the number of edges 
of the graph, as usual. Eq. 66 can be rewritten as a sum 
over the classes 

<? 

Q'^iM) = J2(ers - [ers])Brs, (67) 
r.s 



More generally, if they have the same ties/edges to the same 
vertices, as in a social network there may be different types of 
ties/edges. 



by setting e^s = (1/"t-) I]i^j(aij + hj)AijSa^r5ajs and 
[e,.^] = {^/■m)J2i^jbijSa,rSajs- If one sets = 1 - pij 
and bij = pij , pij can be interpreted as the linking prob- 
ability between i and j, in some null model. Thereof, 
Crs becomes the number of edges running between ver- 
tices of class r and s, and [e^s] the expected number 
of edges in the null model. Reichardt and White set 
Pij = k°'"*kj"- /m, which defines the same null model 
of Newman-Girvan modularity for directed graphs (Sec- 
tion VLB). In fact, if the image graph has only self- 
edges, i. e. Brs — Srs, the quality function Q^({cr}) 
exactly matches modularity. Other choices for the im- 
age graph are possible, however. For instance, a matrix 
Brs — I ~ Srs describes the classes of a g-partite graph 
(Section A.l). From Eq. 67 we see that, for a given clas- 
sification {a}, the image graph that yields the largest 
value of the quality function Q'^({ct}) is that in which 
Brs = 1 when the term e^s — [e^s] is non-negative, and 
Brs = when the term e^s — [e^s] is non-positive. So, 
the best classification is the one maximizing the quality 
function 

S*(W) = ^Ell^--[e-]ll' (68) 

r.s 

where all terms of the sum are taken in absolute value. 
The function Q*{{a}) is maximized via simulated an- 
nealing. The absolute maximum Qmax is obtained by 
construction when q matches the number q* of structural 
equivalence classes of the graph. However, the absolute 
maximum Qmax does not have a meaning by itself, as 
one can achieve fairly high values of Q*({cr}) also for 
null model instances of the original graph, i. e. if one 
randomizes the graph by keeping the same expected in- 
degree and outdegree sequences. In practical applica- 
tions, the optimal number of classes is determined by 
comparing the ratio Q* {q) /Qmax [<?*('?) is the maximum 
of Q*({(t}) for q classes] with the expected ratio for the 
null model. Since classifications for different g-values are 
not hierarchically ordered, overlaps between classes may 
be detected. The method can be trivially extended to 
the case of weighted graphs. 

Model selection (Burnham and Anderson, 2002) aims 
at finding models which are at the same time simple and 
good at describing a system/process. A basic example of 
a model selection problem is curve fitting. There is no 
clear-cut recipe to select a model, but a bunch of heuris- 
tics, like Akaike Information Criterion (AIC) (Akaike, 
1974), Bayesian Information Criterion (BIC) (Schwarz, 
1978), Minimum Description Length (MDL) (Griinwald 
et ai, 2005; Rissanen, 1978), Minimum Message Length 
(MML) (Wallace and Boulton, 1968), etc.. 

The modular structure of a graph can be considered 
as a compressed description of the graph to approximate 
the whole information contained in its adjacency matrix. 
Based on this idea, Rosvall and Bergstrom (Rosvall and 
Bergstrom, 2007) envisioned a communication process in 
which a partition of a graph in communities represents a 
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synthesis Y of the full structure that a signaler sends to 
a receiver, who tries to infer the original graph topology 
X from it (Fig. 21). The same idea is at the basis of an 
earlier method by Sun et al. (Sun et at, 2007), which was 
originally designed for bipartite graphs evolving in time 
and will be described in Section XIII. The best partition 
corresponds to the signal Y that contains the most infor- 
mation about X . This can be quantitatively assessed by 
the minimization of the conditional information H{X\Y) 
of X given Y, 



HiX\Y) = log 



■Q / ^^^("^ - l)/2\ -Q 



i=l 



i>j 



(69) 



where q is the number of clusters, rij the number of ver- 
tices in cluster i, lij the number of edges between clusters 
i and j. We remark that, if one imposes no constraints 
on q, H{X\Y) is minimal in the trivial case in which 
X = Y {H{X\X) = Q). This solution is not acceptable 
because it does not correspond to a compression of infor- 
mation with respect to the original data set. One has to 
look for the ideal tradeoff between a good compression 
and a small enough information H{X\Y). The Minimum 
Description Length (MDL) principle (Griinwald et al., 
2005; Rissanen, 1978) provides a solution to this prob- 
lem, which amounts to the minimization of a function 
given by H{X\Y) plus a function of the number n of 
vertices, m of edges and q of clusters. The optimiza- 
tion is performed by simulated annealing, so the method 
is rather slow and can be applied to graphs with up to 
about 10** vertices. However, faster techniques may in 
principle be used, even if they imply a loss in accuracy. 
The method appears superior than modularity optimiza- 
tion, especially when communities are of different sizes. 
This comes from tests performed on the benchmark of 
Girvan and Newman (Girvan and Newman, 2002) (Sec- 
tion XV. A), both in its original version and in asymmet- 
ric versions, proposed by the authors, where the clusters 
have different sizes or different average degrees. In ad- 
dition, it can detect other types of vertex classifications 
than communities, as in Eq. 69 there are no constraints 
on the relative importance of the edge densities within 
communities with respect to the edge densities between 
communities. The software of the algorithm can be found 
at http : //www. tp.umu. se/~rosvall/ code .html. 

In a recent paper (Rosvall and Bergstrom, 2008), Ros- 
vall and Bergstrom pursued the same idea of describing 
a graph by using less information than that encoded in 
the full adjacency matrix. The goal is to optimally com- 
press the information needed to describe the process of 
information diffusion across the graph. Random walk 
is chosen as a proxy of information diffusion. A two- 
level description, in which one gives unique names to im- 
portant structures of the graph and to vertices within 
the same structure, but the vertex names are recycled 
among different structures, leads to a more compact de- 
scription than by simply coding all vertices with different 
names. This is similar to the procedure usually adopted 



in geographic maps, where the structures are cities and 
one usually chooses the same names for streets of dif- 
ferent cities, as long as there is only one street with a 
given name in the same city. Huffman coding (Huffman, 
1952) is used to name vertices. For the random walk, the 
above-mentioned structures are communities, as it is in- 
tuitive that walkers will spend a lot of time within them, 
so they play a crucial role in the process of information 
diffusion. Graph clustering turns then into the follow- 
ing coding problem: finding the partition that yields the 
minimum description length of an infinite random walk. 
Such description length consists of two terms, expressing 
the Shannon entropy of the random walk within and be- 
tween clusters. Every time the walker steps to a different 
cluster, one needs to use the codeword of that cluster 
in the description, to inform the decoder of the transi- 
tion^^. Clearly, if clusters are well separated from each 
other, transitions of the random walker between clusters 
will be unfrequent, so it is advantageous to use the map, 
with the clusters as regions, because in the description 
of the random walk the codewords of the clusters will 
not be repeated many times, while there is a consider- 
able saving in the description due to the limited length 
of the codewords used to denote the vertices. Instead, if 
there are no well-defined clusters and/or if the partition 
is not representative of the actual community structure 
of the graph, transitions between the clusters of the par- 
tition will be very frequent and there will be little or 
no gain by using the two-level description of the map. 
The minimization of the description length is carried out 
by combining greedy search with simulated annealing. 
In a successive paper (Rosvall et al., 2009), the authors 
adopted the fast greedy technique designed by Blondel 
et al. for modularity optimization (Blondel et al., 2008), 
with some refinements. The method can be applied to 
weighted graphs, both undirected and directed. In the 
latter case, the random walk process is modified by intro- 
ducing a teleportation probability r, to guarantee ergod- 
icity, just like in Google's PageRank algorithm (Brin and 
Page, 1998). The partitions of directed graphs obtained 
by the method differ from those derived by optimizing 
the directed version of Newman- Girvan modularity (Sec- 
tion VLB): this is due to the fact that modularity focuses 
on pairwise relationships between vertices, so it does not 
capture flows. The code of the method is available at 
http : //www . tp . umu . se/~rosvall/ code . html. 

Chakrabarti (Chakrabarti, 2004) has applied the MDL 
principle to put the adjacency matrix of a graph into 
the (approximately) block diagonal form representing the 
best tradeoff between having a limited number of blocks, 
for a good compression of the graph topology, and hav- 
ing very homogeneous blocks, for a compact description 



Instead, for a one-level description, in which all vertices have 
different names, it is enough to specify the codevford of the vertex 
reached at every step to completely define the process, but this 
may be costly. 
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FIG. 21 Basic principle ol the method by RosvaU and Bergstrom (Rosvall and Bergstrom, 2007). An encoder sends to a 
decoder a compressed information about the topology of the graph on the left. The information gives a coarse description 
of the graph, which is used by the decoder to deduce the original graph structure. Reprinted figure with permission from 
Ref. (Rosvall and Bergstrom, 2007). ©2007 by the National Academy of Science of the USA. 



of their structure. The total encoding cost T includes 
the information on the total number of vertices of the 
graph, on the number of blocks and the number of ver- 
tices and edges in each block, along with the adjacency 
matrices of the blocks. The minimization of T is carried 
out by starting from the partition in which the graph is 
a single cluster. At each step, one operates a bipartition 
of the cluster of the partition with the maximum Shan- 
non entropy per vertex. The split is carried out in order 
to remove from the original cluster those vertices carry- 
ing the highest contribution to the entropy per vertex of 
the cluster. Then, starting from the resulting partition, 
which has one more cluster than the previous one, T is 
optimized among those partitions with the same number 
of clusters. The procedure continues until one reaches 
a number of clusters k*, for which T cannot be further 
decreased. The method by Chakrabarti has complexity 
0[/(fc*)^rn], where / is the number of iterations required 
for the convergence of the optimization for a given num- 
ber of clusters, which is usually small (/ < 20 in the 
experiments performed by the author). Therefore the 
algorithm can be applied to fairly large graphs. 

Information theory has also been used to detect com- 
munities in graphs. Ziv et al. (Ziv et ai, 2005) have de- 
signed a method in which the information contained in 
the graph topology is compressed such to preserve some 
predefined information. This is the basic principle of the 
information bottleneck method (Tishby et ai, 1999). To 
understand this criterion, we need to introduce an impor- 
tant measure, the mutual information liX, Y) (Mackay, 
2003) of two random variables X and Y. It is defined as 

nx,Y)^Y.T.Pi-^y)'og-^^^, (70) 



where P{x) indicates the probability that X ~ x (simi- 
larly for P{y)) and P{x,y) is the joint probability of X 
and Y, i. e. P{x,y) = P{X = x,Y = y). The measure 
/(AT, Y) tells how much we learn about X if we know 
y, and viceversa. If X is the input variable, Z the vari- 
able specifying the partition and Y the variable encoding 
the information we want to keep, which is called relevant 
variable, the goal is to minimize the mutual information 
between X and Z (to achieve the largest possible data 
compression), under the constraint that the information 
on Y extractable from Z be accurate. The optimal trade- 
off between the values of I{X, Z) and /(Y, Z) (i. e. com- 
pression versus accuracy) is expressed by the minimiza- 
tion of a functional, where the relative weight of the two 
contributions is given by a parameter playing the role of a 
temperature. In the case of graph clustering, the question 
is what to choose as relevant information variable. Ziv 
et al. proposed to adopt the structural information en- 
coded in the process of diffusion on the graph. They also 
introduce the concept of network modularity, which char- 
acterizes the graph as a whole, not a specific partition like 
the modularity by Newman and Girvan (Section III.C.2). 
The network modularity is defined as the area under the 
information curve, which essentially represents the rela- 
tion between the extent of compression and accuracy for 
all solutions found by the method and all possible num- 
bers of clusters. The software of the algorithm by Ziv et 
al. can be found at http : //www. Columbia. edu/~chw2/. 



X. ALTERNATIVE METHODS 

In this section we describe some algorithms that do 
not fit in the previous categories, although some overlap 
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is possible. 

Raghavan et al. (Raghavan et ai, 2007) have designed 
a simple and fast method based on label propagation. 
Vertices are initially given unique labels (e.g. their ver- 
tex labels). At each iteration, a sweep over all vertices, 
in random sequential order, is performed: each vertex 
takes the label shared by the majority of its neighbors. 
If there is no unique majority, one of the majority labels 
is picked at random. In this way, labels propagate across 
the graph: most labels will disappear, others will domi- 
nate. The process reaches convergence when each vertex 
has the majority label of its neighbors. Communities 
are defined as groups of vertices having identical labels 
at convergence. By construction, each vertex has more 
neighbors in its community than in any other commu- 
nity. This resembles the strong definition of community 
we have discussed in Section III.B.2, although the latter 
is stricter, in that each vertex must have more neighbors 
in its community than in the rest of the graph. The al- 
gorithm does not deliver a unique solution. Due to the 
many ties encountered along the process it is possible 
to derive different partitions starting from the same ini- 
tial condition, with different random seeds. Tests on real 
graphs show that all partitions found are similar to each 
other, though. The most precise information that one can 
extract from the method is contained by aggregating the 
various partitions obtained, which can be done in various 
ways. The authors proposed to label each vertex with the 
set of all labels it has in different partitions. Aggregat- 
ing partitions enables one to detect possible overlapping 
communities. The main advantage of the method is the 
fact that it does not need any information on the num- 
ber and the size of the clusters. It does not need any 
parameter, either. The time complexity of each itera- 
tion of the algorithm is 0(m), the number of iterations 
to convergence appears independent of the graph size, or 
growing very slowly with it. So the technique is really 
fast and could be used for the analysis of large systems. 
In a recent paper (Tibcly and Kcrtcsz, 2008), Tibely and 
Kertesz showed that the method is equivalent to finding 
the local energy minima of a simple zero-temperature ki- 
netic Potts model, and that the number of such energy 
minima is considerably larger than the number of vertices 
of the graph. Aggregating partitions as Raghavan et al. 
suggest leads to a fragmentation of the resulting partition 
in clusters that are the smaller, the larger the number of 
aggregated partitions. This is potentially a serious prob- 
lem of the algorithm by Raghavan et al., especially when 
large graphs are investigated. In order to eliminate unde- 
sired solutions, Barber and Clark introduced some con- 
straints in the optimization process (Barber and Clark, 
2009). This amounts to adding some terms to the ob- 
jective function H whose maximization is equivalent to 
the original label propagation algorithm^^. Interestingly, 



if one imposes the constraint that partitions have to be 
balanced, i. e. that clusters have similar total degrees, 
the objective function becomes formally equivalent to 
Newman-Girvan modularity Q (Section III.C.2), so the 
corresponding version of the label propagation algorithm 
is essentially based on a local optimization of modularity. 
Leung et al. have found that the original algorithm by 
Raghavan et al., applied on online social networks, often 
yields partitions with one giant community together with 
much smaller ones (Leung et al, 2009). In order to avoid 
this disturbing feature, which is an artefact of the algo- 
rithm, Leung et al. proposed to modify the method by 
introducing a score for the labels, which decreases as the 
label propagates far from the vertex to which the label 
was originally assigned. When choosing the label of a 
vertex, the labels of its neighbors are weighted by their 
scores, therefore a single label cannot span too large por- 
tions of the graph (as its weight fades away with the dis- 
tance from the origin) , and no giant communities can be 
recovered. Tests of the modified algorithm on the LFR 
benchmark (Lancichinetti et al., 2008) (Section XII. A) 
give good results and encourage further investigations. 

Bagrow and BoUt designed an agglomerative tech- 
nique, called L-shell method (Bagrow and Bollt, 2005). 
It is a procedure that finds the community of any ver- 
tex, although the authors also presented a more gen- 
eral procedure to identify the full community structure 
of the graph. Communities are defined locally, based 
on a simple criterion involving the number of edges in- 
side and outside a group of vertices. One starts from a 
vertex-origin and keeps adding vertices lying on succes- 
sive shells, where a shell is defined as a set of vertices at 
a fixed geodesic distance from the origin. The first shell 
includes the nearest neighbours of the origin, the second 
the next-to-nearest neighbours, and so on. At each it- 
eration, one calculates the number of edges connecting 
vertices of the new layer to vertices inside and outside 
the running cluster. If the ratio of these two numbers 
("emerging degree") exceeds some predefined threshold, 
the vertices of the new shell are added to the cluster, oth- 
erwise the process stops. The idea of closing a community 
by expanding a shell has been previously introduced by 
Costa (da Fontoura Costa, 2004), in which shells are cen- 
tered on hubs. However, in this procedure the number of 
clusters is preassigned and no cluster can contain more 
than one hub. Because of the local nature of the process, 
the L-shell method is very fast and can identify commu- 
nities very quickly. Unfortunately the method works well 
only when the source vertex is approximately equidistant 
from the boundary of its community. To overcome this 
problem, Bagrow and Bollt suggested to repeat the pro- 
cess starting from every vertex and derive a membership 



graph and 5 is Kronecker's function. It is just the negative of 
the energy of a zero-temperature Potts model, as found by Tibely 
H = ^ij^ijy where A is the adjacency matrix of the and Kertesz (Tibely and Kertesz, 2008) 
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matrix M: the element Mij is one if vertex j belongs 
to the community of vertex i, otherwise it is zero. The 
membership matrix can be rewritten by suitably permu- 
tating rows and columns based on their mutual distances. 
The distance between two rows (or columns) is defined as 
the number of entries whose elements differ. If the graph 
has a clear community structure, the membership ma- 
trix takes a block-diagonal form, where the blocks iden- 
tify the communities. The method enables one to de- 
tect overlaps between communities as well (Porter et ai, 
2007). Unfortunately, the rearrangement of the matrix 
requires a time 0{n^), so it is quite slow. A variant 
of the algorithm by Bagrow and Bollt, in which bound- 
ary vertices are examined separately and both first and 
second nearest neighbors of the running community are 
simultaneously investigated, was suggested by Rodrigues 
et al. (Rodrigues et ai, 2007). 

A recent methodology introduced by Papadopoulos et 
al. (Papadopoulos et ai, 2009), called Bridge Bounding, 
is similar to the L-shell algorithm, but here the clus- 
ter around a vertex grows until one "hits" the boundary 
edges. Such edges can be recognized from the values of 
various measures, like betweenness (Girvan and Newman, 
2002) or the edge clustering coefficient (Radicchi et ai, 
2004). The problem is that there are often no clear gaps 
in the distributions of the values of such measures, so 
one is forced to set a threshold to automatically iden- 
tify the boundary edges from the others, and there is no 
obvious way to do it. The best results of the algorithm 
are obtained by using a measure consisting of a weighted 
sum of the edge clustering coefficient over a wider neigh- 
borhood of the given edge. This version of the method 
has a time complexity 0{{k)'^m + {k)n), where (k) is the 
average degree of the graph. 

In another algorithm by Clauset, local communities are 
discovered through greedy maximization of a local mod- 
ularity measure (Clauset, 2005). Given a community C, 
the boundary B is the set of vertices of C with at least one 
neighbor outside C (Fig. 22). The local modularity R by 
Clauset is the ratio of the number of edges having both 
endpoints in C (but at least one in B), with the number 
of edges having at least one endpoint in B. It is a mea- 
sure of the sharpness of the community boundary. Its 
optimization consists of a local exploration of the com- 
munity starting from a source vertex: at each step the 
neighboring vertex yielding the largest increase (smallest 
decrease) of R is added, until the community has reached 
a predefinite size Uc- This greedy optimization takes a 
time 0{n'^{k)), where (k) is the average degree of the 
graph. The local modularity R has been used in a paper 
by Hui et al. (Hui et ai, 2007), where methods to find 
communities in networks of mobile devices are designed. 

Another method, where communities are defined based 
on a local criterion, was presented by Eckmann and 
Moses (Eckmann and Moses, 2002). The idea is to use 
the clustering coefficient (Watts and Strogatz, 1998) of 
a vertex as a quantity to distinguish tightly connected 
groups of vertices. Many edges mean many loops inside 




FIG. 22 Schematic picture of a community C used in the 
definition of local modularity by Clauset (Clauset, 2005). The 
black area indicates the subgraph of C including all vertices of 
C, whose neighbors are also in C. The boundary B entails the 
vertices of C with at least one neighbor outside the community. 
Reprinted figure with permission from Ref. (Clauset, 2005). 
©2005 by the American Physical Society. 



a community, so the vertices of a community are likely 
to have a large clustering coefficient. The latter can be 
related to the average distance between pairs of neigh- 
bours of the vertex. The possible values of the distance 
are 1 (if neighbors are connected) or 2 (if they are not), 
so the average distance lies between 1 and 2. The more 
triangles there are in the subgraph, the shorter the av- 
erage distance. Since each vertex always has distance 1 
from its neighbours, the fact that the average distance 
between its neighbours is different from 1 reminds what 
happens when one measures segments on a curved sur- 
face. Endowed with a metric, represented by the geodesic 
distance between vertices/points, and a curvature, the 
graph can be embedded in a geometric space. Communi- 
ties appear as portions of the graph with a large curva- 
ture. The algorithm was applied to the graph represen- 
tation of the World Wide Web, where vertices are web 
pages and edges are the hyperlinks that take users from a 
page to the other. The authors found that communities 
correspond to web pages dealing with the same topic. 

Long et al. have devised an interesting technique that 
is able to detect various types of vertex groups, not nec- 
essarily communities (Long et a/., 2007). The method 
is based on graph approximation, as it tries to match 
the original graph topology onto a coarse type of graph, 
the community prototype graph, which has a clear group 
structure (block-diagonal for clusters, block-off-diagonal 
for classes of multipartite graphs, etc.). The goal is to 
determine the community prototype graph that best ap- 
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proximates the graph at study, where the goodness of the 
approximation is expressed by the distance between the 
corresponding matrices. In this way the original prob- 
lem of finding graph subsets becomes an optimization 
problem. Long et al. called this procedure Community 
Learning by Graph Approximation (CLGA). Sometimes 
the minimization of the matrix distance can be turned 
into the maximization of the trace of a matrix. Measures 
like cut size or ratio cut can be also formulated as the 
trace of matrices (see for instance Eq. 18). In fact, CLGA 
includes traditional graph partitioning as a special case 
(Section IV. A). Long et al. designed three algorithms 
for CLGA: two of them seek for divisions of the graph 
into overlapping or non-overlapping groups, respectively; 
in the third one an additional constraint is introduced 
to produce groups of comparable size. The complexity 
of these algorithms is 0{tn^k), where t is the number 
of iterations until the optimization converges and k the 
number of groups. The latter has to be given as an input, 
which is a serious limit of CLGA. 

A fast algorithm by Wu and Huberman identifies com- 
munities based on the properties of resistor networks (Wu 
and Huberman, 2004) . It is essentially a method for par- 
titioning graphs in two parts, similar to spectral bisec- 
tion, although partitions in an arbitrary number of com- 
munities can be obtained by iterative applications. The 
graph is transformed into a resistor network where each 
edge has unit resistance. A unit potential difference is 
set between two randomly chosen vertices. The idea is 
that, if there is a clear division in two communities of 
the graph, there will be a visible gap between voltage 
values for vertices at the borders between the clusters. 
The voltages are calculated by solving Kirchoff's equa- 
tions: an exact solution would be too time consuming, 
but it is possible to find a reasonably good approximation 
in a linear time for a sparse graph with a clear commu- 
nity structure, so the more time consuming part of the 
algorithm is the sorting of the voltage values, which takes 
time 0{nlogn). Any possible vertex pair can be chosen 
to set the initial potential difference, so the procedure 
should be repeated for all possible vertex pairs. The au- 
thors showed that this is not necessary, and that a limited 
number of sampling pairs is sufficient to get good results, 
so the algorithm scales as 0(n log n) and is very fast. An 
interesting feature of the method is that it can quickly 
find the natural community of any vertex, without de- 
termining the complete partition of the graph. For that, 
one uses the vertex as source voltage and places the sink 
at an arbitrary vertex. The same feature is present in an 
older algorithm by Flake et al. (Flake et al., 2002), where 
one uses max- flow instead of current flow (Section IV. A). 
An algorithm by Orponen and Schaeffer (Orponcn and 
Scliaeffer, 2005) is based on the same principle, but it 
does not need the specification of target sources as it is 
based on diffusion in an unbounded medium. The limit 
of such methods is the fact that one has to give as in- 
put the number of clusters, which is usually not known 
beforehand. 



Ohkubo and Tanaka (Ohkubo and Tanaka, 2006) 
pointed out that, since communities are rather compact 
structures, they should have a small volume, where the 
volume of a community is defined as the ratio of the 
number of vertices by the internal edge density of the 
community. Ohkubo and Tanaka assumed that the sum 
Vtotai of the volumes of the communities of a partition is 
a reliable index of the goodness of the partition. So, the 
most relevant partition is the one minimizing Vtotai- The 
optimization is carried out with simulated annealing. 

Zarei and Samani (Zarei and Samani, 2009) remarked 
that there is a symmetry between community structure 
and anti-community (multipartite) structure, when one 
considers a graph and its complement, whose edges are 
the missing edges of the original graph. In fact, if a graph 
has a well identified communities, the same groups of 
vertices would be strong anti-communities in the com- 
plement graph, i. e. they should have a few intra- 
cluster edges and many inter-cluster edges. Based on 
this remark, the communities of a graph can be iden- 
tified by looking for anticommunities in the comple- 
ment graph, which can sometimes be easier. Zarei and 
Samani devised a spectral method using matrices of the 
complement graph. The results of this technique ap- 
pear good as compared to other spectral methods on 
artificial graphs generated with the planted ^-partition 
model (Condon and Karp, 2001), as well as on Zachary's 
karate club (Zachary, 1977), Lusseau's dolphins' net- 
work (Lusseau, 2003) and a network of protein-protein 
interactions. However, the authors have used very small 
graphs for testing. Communities make sense on sparse 
graphs, but the complements of large sparse graphs would 
not be sparse, but very dense, and their community (mul- 
tipartite) structure basically invisible. 

Gudkov and Montealegre detected communities by 
means of dynamical simplex evolution (Gudkov et al., 
2008). Graph vertices are represented as points in an 
(n — 1) -dimensional space. Each point initially sits on 
the n vertices of a simplex, and then moves in space 
due to forces exerted by the other points. If vertices 
are neighbors, the mutual force acting on their repre- 
sentative points is attractive, otherwise it is repulsive. 
If the graph has a clear community structure, the cor- 
responding spatial clusters repel each other because of 
the few connections between them (repulsion dominates 
over attraction). If communities are more mixed with 
each other, clusters are not well separated and they could 
be mistakenly aggregated in larger structures. To avoid 
that, Gudkov and Montealegre defined clusters as groups 
of points such that the distance between each pair of 
points does not exceed a given threshold, which can be 
arbitrarily tuned, to reveal structures at different resolu- 
tions (Section XII. A). The algorithm consists in solving 
first-order differential equations, describing the dynam- 
ics of mass points moving in a viscous medium. The 
complexity of the procedure is O(n^). Differential equa- 
tions are also at the basis of a recent method designed by 
Krawczyk and Kulakowski (Krawczyk, 2008; Krawczyk 
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and Kulakowski, 2007). Here the equations describe a 
dynamic process, in which the original graph topology 
evolves to a disconnected graph, whose components are 
the clusters of the original graph. 

Despite the significant improvements in computational 
complexity, it is still problematic to apply clustering al- 
gorithms to many large networks available today. There- 
fore Narasimhamurthy et al. (Narasimhamurthy et ai, 
2008) proposed a two-step procedure: first, the graph 
at study is decomposed in smaller pieces by a fast 
graph partitioning technique; then, a clustering method 
is applied to each of the smaller subgraphs obtained 
[Narasimhamurthy et al. used the Clique Percolation 
Method (Section XI. A)]. The initial decomposition of the 
graph is carried out through the multilevel method by 
Dhillon et al. (Dhillon et al., 2007). It is crucial to verify 
that the initial partitioning does not split the commu- 
nities of the graph among the various subgraphs of the 
decomposition. This can be done by comparing, on arti- 
ficial graphs, the final clusters obtained with the two-step 
method with those detected by applying the chosen clus- 
tering technique to the entire graph. 

XI. METHODS TO FIND OVERLAPPING 
COMMUNITIES 

Most of the methods discussed in the previous sec- 
tions aim at detecting standard partitions, i. e. partitions 
in which each vertex is assigned to a single community. 
However, in real graphs vertices are often shared between 
communities (Section II), and the issue of detecting over- 
lapping communities has become quite popular in the last 
few years. We devote this section to the main techniques 
to detect overlapping communities. 

A. Clique percolation 

The most popular technique is the Clique Percolation 
Method (CPM) by Palla et al. (Palla et al., 2005). It is 
based on the concept that the internal edges of a com- 
munity are likely to form cliques due to their high den- 
sity. On the other hand, it is unlikely that intercom- 
munity edges form cliques: this idea was already used 
in the divisive method of Radicchi et al. (Section V.B). 
Palla et al. use the term fc-clique to indicate a com- 
plete graph with k vertices"'^^. Notice that a fc-clique is 
different from the n-clique (see Section III.B.2) used in 
social science. If it were possible for a clique to move 
on a graph, in some way, it would probably get trapped 
inside its original community, as it could not cross the 
bottleneck formed by the intercommunity edges. Palla et 
al. introduced a number of concepts to implement this 



In graph theory the k-chque by Palla et al. is simply called 
clique, or complete graph, with k vertices (Section A.l). 




FIG. 23 Clique Percolation Method. The example shows 
communities spanned by adjacent 4-cliques. Overlapping ver- 
tices are shown by the bigger dots. Reprinted figure with per- 
mission from Ref. (Palla et al, 2005). ©2005 by the Nature 
Publishing Group. 



idea. Two fc-cliques are adjacent if they share k — 1 ver- 
tices. The union of adjacent fc-cliques is called k-clique 
chain. Two fc-cliques are connected if they are part of 
a fc-clique chain. Finally, a k-clique community is the 
largest connected subgraph obtained by the union of a 
fc-clique and of all fc-cliques which are connected to it. 
Examples of fc-clique communities are shown in Fig. 23. 
One could say that a fc-clique community is identified by 
making a fc-clique "roll" over adjacent fc-cliques, where 
rolling means rotating a fc-clique about the fc — 1 vertices 
it shares with any adjacent fc-clique. By construction, 
fc-clique communities can share vertices, so they can be 
overlapping. There may be vertices belonging to non- 
adjacent fc-cliques, which could be reached by different 
paths and end up in different clusters. Unfortunately, 
there are also vertices that cannot be reached by any fc- 
clique, like, e. g. vertices with degree one ("leaves"). 
In order to find fc-clique communities, one searches first 
for maximal cliques. Then a clique-clique overlap ma- 
trix O is built (Everett and Borgatti, 1998), which is an 
Uc X 71c matrix, Hc being the number of cliques; Oij is 
the number of vertices shared by cliques i and j. To 
find fc-cliques, one needs simply to keep the entries of O 
which are larger than or equal to fc — 1, set the others 
to zero and find the connected components of the re- 
sulting matrix. Detecting maximal cliques is known to 
require a running time that grows exponentially with the 
size of the graph. However, the authors found that, for 
the real networks they analyzed, the procedure is quite 
fast, due to the fairly limited number of cliques, and that 
(sparse) graphs with up to 10^ vertices can be analyzed 
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in a reasonably short time. The actual scalability of the 
algorithm depends on many factors, and cannot be ex- 
pressed in closed form. An interesting aspect of /c-clique 
communities is that they allow to make a clear distinc- 
tion between random graphs and graphs with community 
structure. This is a rather delicate issue: we have seen in 
Section VI. C that Newman-Girvan modularity can attain 
large values on random graphs. Derenyi et al. (Dercnyi 
et al, 2005) have studied the percolation properties of 
fc-cliques on random graphs, when the edge probability 
p varies. They found that the threshold Pc{k) for the 
emergence of a giant /c-clique community, i. e. a com- 
munity occupying a macroscopic portion of the graph, is 
Pc{k) = [(A;— n being the number of vertices 
of the graph, as usual. For fc = 2, for which the /c-cliques 
reduce to edges, one recovers the known expression for 
the emergence of a giant connected component in Erdos- 
Renyi graphs (Section A. 3). This percolation transition 
is quite sharp: if the edge probability p < Pcik), fc-clique 
communities are rather small; if p > Pc{k) there is a gi- 
ant component and many small communities. To assess 
the significance of the clusters found with the CPM, one 
can compare the detected cover with the cover found 
on a null model graph, which is random but preserves 
the expected degree sequence of the original graph. The 
modularity of Newman and Girvan is based on the same 
null model (Section III.C.2). The null models of real 
graphs seem to display the same two scenarios found 
for Erdos-Renyi graphs, characterized by the presence 
of very small fc-clique communities, with or without a 
giant cluster. Therefore, covers with fc-clique communi- 
ties of large or appreciable size can hardly be due to 
random fluctuations. Palla and coworkers (Adamcsck 
et al., 2006) have designed a software package implement- 
ing the CPM, called CFinder, which is freely available 
(www . cf inder . org) . 

The algorithm has been extended to the analysis of 
weighted, directed and bipartite graphs. For weighted 
graphs, in principle one can follow the standard proce- 
dure of thresholding the weights, and apply the method 
on the resulting graphs, treating them as unweighted. 
Farkas et al. (Farkas et ai, 2007) proposed instead to 
threshold the weight of cliques, defined as the geomet- 
ric mean of the weights of all edges of the clique. The 
value of the threshold is chosen slightly above the criti- 
cal value at which a giant fc-clique community emerges, 
in order to get the richest possible variety of clusters. On 
directed graphs, Palla et al. defined directed k-cliques as 
complete graphs with fc vertices, such that there is an 
ordering among the vertices, and each edge goes from a 
vertex with higher order to one with lower order. The or- 
dering is determined from the restricted outdegree of the 
vertex, expressing the fraction of outgoing edges point- 



We remind that cover is the equivalent of partition for overlap- 
ping communities. 



ing to the other vertices of the clique versus the total 
outdegree. The method has been extended to bipartite 
graphs by Lehmann et al. (Lehmann et ai, 2008). In this 
case one uses bipartite cliques, or bicliques: a subgraph 
Ka^b is a biclique if each of a vertices of one class are 
connected with each of b vertices of the other class. Two 
cliques ifa,fc are adjacent if they share a clique Ka^i^b-ij 
and a Ka^b clique community is the union of all Ka.b 
cliques that can be reached from each other through a 
path of adjacent Ka.b cliques. Finding all Nc bicliques 
of a graph is an NP-complete problem (Peeters, 2003), 
mostly because the number of bicliques tends to grow 
exponentially with the size of the graph. The algorithm 
designed by Lehmann et al. to find biclique communities 
is similar to the original CPM, and has a total complex- 
ity of 0{N^). On sparse graphs, Nc often grows linearly 
with the number of edges m, yielding a time complexity 
0{m?). Bicliques are also the main ingredients of BiTec- 
tor, a recent algorithm to detect community structure in 
bipartite graphs (Du et ai, 2008). 

Kumpula et al. have developed a fast implementa- 
tion of the CPM, called Sequential Clique Percolation 
algorithm (SCP) (Kumpula et ai, 2008). It consists in 
detecting fc-clique communities by sequentially inserting 
the edges of the graph at study, one by one, starting 
from an initial empty graph. Whenever a new edge is 
added, one checks whether new fc-cliques are formed, by 
searching for (fc — 2)-cliques in the subset of neighboring 
vertices of the endpoints of the inserted edge. The pro- 
cedure requires to build a graph F* , in which the vertices 
are (fc — l)-cliques and edges are set between vertices cor- 
responding to (fc — l)-cliques which are subgraphs of the 
same fc-clique. At the end of the process, the connected 
components of F* correspond to the searched fc-clique 
communities. The technique has a time complexity which 
is linear in the number of fc-cliques of the graph, so it can 
vary a lot in practical applications. Nevertheless, it turns 
out to be much faster than the original implementation 
of the CPM. The big advantage of the SCP, however, 
consists of its implementation for weighted graphs. By 
inserting edges in decreasing order of weight, one recov- 
ers in a single run the community structure of the graph 
for all possible weight thresholds, by storing every cover 
detected after the addition of each edge. The standard 
CPM, instead, needs to be applied once for each thresh- 
old. If, instead of edge weight thresholding, one performs 
fc-clique weight thresholding, as prescribed by Farkas et 
al. (Farkas et ai, 2007), the SCP remains much faster 
than the CPM, if one applies a simple modification to 
it, consisting in detecting and storing all fc-cliques on the 
full graph, sorting them based on their weights, and find- 
ing the communities by sequentially adding the fc-cliques 
in decreasing order of weight. 

The CPM has the same limit as the algorithm of Radic- 
chi et al. (Radicchi et ai, 2004) (Section V.B): it assumes 
that the graph has a large number of cliques, so it may 
fail to give meaningful covers for graphs with just a few 
cliques, like technological networks and some social net- 
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works. On the other hand, if there are many chques, the 
method may dehver trivial community structure, hke a 
cover consisting of the whole graph as a single cluster. 
A more fundamental issue is the fact that the method 
does not look for actual communities, consistent with 
the shared notion of dense subgraphs, but for subgraphs 
"containing" many cliques, which may be quite differ- 
ent objects than communities (for instance, they could 
be "chains" of cliques with low internal edge density). 
Another big problem is that on real networks there is a 
considerable fraction of vertices that are left out of the 
communities, like leaves. One could think of some post- 
processing procedure to include them in the communities, 
but for that it is necessary to introduce a new criterion, 
outside the framework that inspired the method. Fur- 
thermore it is not clear a priori which value of k one has 
to choose to identify meaningful structures. Finally, the 
criterion to choose the threshold for weighted graphs and 
the definition of directed fc-cliques are rather arbitrary. 

B. Other techniques 

One of the first methods to find overlapping commu- 
nities was designed by Baumes et al. (Baumcs et ai, 
2005b). A community is defined as a subgraph which 
locally optimizes a given function W, typically some mea- 
sure related to the edge density of the cluster^^. Different 
overlapping subsets may all be locally optimal, so vertices 
can be shared between communities. Detecting the clus- 
ter structure of a graph amounts to finding the set of 
all locally optimal clusters. Two efficient heuristics are 
proposed, called Iterative Scan (IS) and Rank Removal 
(RaRe). IS performs a greedy optimization of the func- 
tion W. One starts from a random seed vertex/edge and 
adds/deletes vertices one by one as long as W increases. 
Then another seed is randomly picked and the procedure 
is repeated. The algorithm stops when, by picking any 
seed, one recovers a previously identified cluster. RaRe 
consists in removing important vertices such to discon- 
nect the graphs in small components representing the 
cores of the clusters. The importance of vertices is deter- 
mined by their centrality scores (e.g. degree, betweenness 
centrality (Freeman, 1977)), PageRank (Brin and Page, 
1998)). Vertices are removed until one fragments the 
graph into components of a given size. After that, the 
removed vertices are added again to the graph, and are 
associated to those clusters for which doing so increases 
the value of the function W. The complexity of IS and 
RaHe is 0{v?) on sparse graphs. The best performance is 
achieved by using IS to refine results obtained from RaRe. 
In a successive paper (Baumcs et ai, 2005a), Baumes et 
al. further improved such two-step procedure, in that the 



Community definitions based on local optimization are adopted 
in other algorithms as well, like that by Lancichinetti et al. (Lan- 
cichinctti et al, 2009) (Section XII. A). 



removed vertices in RaRe are reinserted in decreasing or- 
der of their centrality scores, and the optimization of W 
in IS is only extended to neighboring vertices of the run- 
ning cluster. The new recipe maintains time complexity 
O(n^), but on sparse graphs it requires a time lower by 
an order of magnitude than the old one, while the quality 
of the detected clustering is comparable. 

A different method, combining spectral mapping, fuzzy 
clustering and the optimization of a quality function, has 
been presented by Zhang et al. (Zhang et at, 2007). The 
membership of vertex i in cluster k is expressed by Uik , 
which is a number between and 1. The sum of the 
Uik over all communities fc of a cover is 1, for every ver- 
tex. This normalization is suggested by the fact that 
the entry un- can be thought of as the probability that 

1 belongs to community fc, so the sum of the Uik rep- 
resents the probability that the vertex belongs to any 
community of the cover, which is necessarily 1. If there 
were no overlaps, Uik = Sk^k, where ki represents the 
unique community of vertex i. The algorithm consists of 
three phases: 1) embedding vertices in Euclidean space; 
2) grouping the corresponding vertex points in a given 
number of clusters; 3) maximizing a modularity func- 
tion over the set of covers found in step 2) , corresponding 
to different values of Uc- This scheme has been used in 
other techniques as well, like in the algorithm of Donetti 
and Muhoz (Donetti and Munoz, 2004) (Section VII). 
The first step builds upon a spectral technique intro- 
duced by White and Smyth (White and Smyth, 2005), 
that we have discussed in Section VI. A. 4. Graph vertices 
are embedded in a d-dimensional Euclidean space by us- 
ing the top d eigenvectors of the right stochastic matrix 
W (Section A. 2), derived from the adjacency matrix A 
by dividing each element by the sum of the elements of 
the same row. The spatial coordinates of vertex i are 
the i-th components of the eigenvectors. In the second 
step, the vertex points are associated to ric clusters by us- 
ing fuzzy fc-means clustering (Bczdck, 1981; Dunn, 1974) 
(Section IV. C). The number of clusters ric varies from 

2 to a maximum K, so one obtains K — 1 covers. The 
best cover is the one that yields the largest value of the 
modularity Qo^, defined as 

c— 1 ^ ^ 

where 
and 

iev^,]ev\v^ 

The sets Vc and V include the vertices of module c and of 
the whole network, respectively. Eq. 71 is an extension of 
the weighted modularity in Eq. 36, obtained by weighing 
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the contribution of the edges' weights to the sums in Wc 
and Sc by the (average) membership coefficients of the 
vertices of the edge. The determination of the eigenvec- 
tors is the most computationally expensive part of the 
method, so the time complexity is the same as that of 
the algorithm by White and Smyth (see Section VI. A. 4), 
i. e. 0{K^n-\- Km), which is essentially linear in n if the 
graph is sparse and K <ti n. 

Nepusz et al. proposed a different approach based 
on vertex similarity (Nepusz et ai, 2008). One starts 
from the membership matrix U, defined as in the pre- 
vious method by Zhang et al. From U a matrix S is 
built, where stj = X]fc=i '^ifc^fe' expressing the similar- 
ity between vertices {ric is the number of clusters). If 
one assumes to have information about the actual vertex 
similarity, corresponding to the matrix S, the best cover 
is obtained by choosing U such that S approximates as 
closely as possible S. This amounts to minimize the func- 
tion 

n n 

(U) - E E - ^^^■)'' (74) 

i=i j=i 

where the Wij weigh the importance of the approximation 
for each entry of the similarity matrices. In the absence 
of any information on the community structure of the 
graph, one sets Wij = 1, Vi, j (equal weights) and S equal 
to the adjacency matrix A, by implicitly assuming that 
vertices are similar if they are neighbors, dissimilar oth- 
erwise. On weighted graphs, one can set the Wij equal to 
the edge weights. Minimizing Dg{\J) is a nonlinear con- 
strained optimization problem, that can be solved with 
a gradient-based iterative optimization method, like sim- 
ulated annealing. The optimization procedure adopted 
by Nepusz et al., for a fixed number of clusters ric, has 
a time complexity 0{n^nch), where h is the number of 
iterations leading to convergence, so the method can only 
be applied to fairly small graphs. If ric is unknown, as it 
usually happens, the best cover is the one corresponding 
to the largest value of the modularity 

«=ii:(^0-^).«- (75) 

Eq. 75 is very similar to the expression of Newman- 
Girvan modularity (Eq. 13): the difference is that the 
Kronecker's 6 is replaced by the vertices' similarity, to ac- 
count for overlapping communities. Once the best cover 
is identified, one can use the entries of the partition ma- 
trix U to evaluate the participation of each vertex in the 
Uc clusters of the cover. Nepusz et al. defined the brid- 
geness bi of a vertex i as 




If i belongs to a single cluster, — 0. If, for a vertex 
i, Uik = l/ric, VA:, bi = 1 and i is a perfect bridge, as 



it lies exactly between all clusters. However, a vertex 
with low bi may be simply an outlier, not belonging 
to any cluster. Since real bridges are usually rather 
central vertices, one can identify them by checking 
for large values of the centrality- corrected bridgeness, 
obtained by multiplying the bridgeness of Eq. 76 by 
the centrality of the vertex (expressed by, e.g., degree, 
betweenness (Freeman, 1977), etc.). A variant of the 
algorithm by Nepusz et al. can be downloaded from 
http : //www . cs . rhul . ac . uk/home/tamas/ asset s/f ile 
s/f uzzyclust-static . tar . gz. 

In real networks it is often easier to discriminate be- 
tween intercluster and intracluster edges than recogniz- 
ing overlapping vertices. For instance, in social networks, 
even though many people may belong to more groups, 
their social ties within each group can be easily spotted. 
Besides, it may happen that communities are joined to 
each other through their overlapping vertices (Fig. 24), 
without intercluster edges. For these reasons, it has been 
recently suggested that defining clusters as sets of edges, 
rather than vertices, may be a promising strategy to an- 
alyze graphs with overlapping communities (Ahn et al., 
2009; Evans and Lambiotte, 2009). One has to focus 
on the line graph (Balakrishnan, 1997), i. e. the graph 
whose vertices are the edges of the original graph; ver- 
tices of the line graph are linked if the corresponding 
edges in the original graph are adjacent, i. e. if they 
share one of their endvertices. Partitioning the line graph 
means grouping the edges of the starting graph^^. Evans 
and Lambiotte (Evans and Lambiotte, 2009) introduced 
a set of quality functions, similar to Newman-Girvan 
modularity (Eq. 13), expressing the stability of parti- 
tions against random walks taking place on the graph, 
following the work of Delvenne et al. (Delvenne et al., 
2008) [Section VIII. B]. They considered a projection of 
the traditional random walk on the line graph, along with 
two other diffusion processes, where walkers move be- 
tween adjacent edges (rather than between neighboring 
vertices). Evans and Lambiotte optimized the three cor- 
responding modularity functions to look for partitions in 
two real networks, Zachary's karate club (Zacliary, 1977) 
(Section XV. A) and the network of word associations de- 
rived from the University of South Florida Free Associa- 
tion Norms (Nelson et al., 1998) (Section II). The opti- 
mization was carried out with the hierarchical technique 
by Blondel et al. (Blondel et al., 2008) and the multi-level 
algorithm by Noack and Rotta (Noack and Rotta, 2009). 
While the results for the word association network are 
reasonable, the test on the karate club yields partitions 
in more than two clusters. However, the modularities 
used by Evans et Lambiotte can be modified to include 
longer random walks (just like in Ref. (Delvenne et al., 



Ideally one wants to put together only the edges lying within 
clusters, and exclude the others. Therefore partitioning does not 
necessarily mean assigning each vertex of the lino graph to a 
group, as standard clustering techniques would do. 
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FIG. 24 Communities as sets of edges. In the figure, the 
graph has a natural division in two triangles, with the central 
vertex shared between them. If communities are identified by 
their internal edges, detecting the triangles and their overlap- 
ping vertex becomes easier than by using methods that group 
vertices. Reprinted figure with permission from Ref. (Evans 
and Lambiotte, 2009). ©2009 by the American Physical So- 
ciety. 



2008)), and the length of the walk represents a resolu- 
tion parameter that can be tuned to get better results. 
Ahn et al. (Aim et ai, 2009) proposed to group edges 
with an agglomerative hierarchical clustering technique, 
called hierarchical link clustering (Section IV. B). They 
use a similarity measure for a pair of (adjacent) edges 
that expresses the size of the overlap between the neigh- 
borhoods of the non-coincident endvertices, divided by 
the total number of (different) neighbors of such end- 
vertices. Groups of edges are merged pairwise in de- 
scending order of similarity, until all edges are together in 
the same cluster. The resulting dendrogram provides the 
most complete information on the community structure 
of the graph. However, as usual, most of this informa- 
tion is redundant and is an artefact of the procedure it- 
self. So, Ahn et al. introduced a quality function to select 
the most meaningful partition(s), called partition density, 
which is essentially the average edge density within the 
clusters. The method is able to find meaningful clusters 
in biological networks, like protein-protein and metabolic 
networks, as well as in a social network of mobile phone 
communications. It can also be extended to multipartite 
and weighted graphs. 

The idea of grouping edges is surely interesting. How- 
ever it is not a priori better than grouping vertices. 
In fact, the two situations are somewhat symmetric. 
Edges connecting vertices of different clusters are "over- 
lapping" , but they will be assigned just to one cluster (or 
else the clusters would be merged). 

The possibility of having overlapping communities 
makes most standard clustering methods inadequate, and 
enforces the design of new ad hoc techniques, like the 
ones we have described so far. On the other hand, if 
it were possible to identify the overlapping vertices and 
"separate" them among the clusters they belong to, the 



overlaps would be removed and one could then apply 
any of the traditional clustering methods to the result- 
ing graph. This idea is at the basis of a recent method 
proposed by Gregory (Gregory, 2009). It is a three- 
stages procedure: first, one transforms the graph into 
a larger graph without overlapping vertices; second, a 
clustering technique is applied to the resulting graph; 
third, one maps the partition obtained into a cover by 
replacing the vertices with those of the original graph. 
The transformation step, called Peacock, is performed 
by identifying the vertices with highest split betweenness 
(Section V.A) and splitting them in multiple parts, con- 
nected by edges. This is done as long as the split be- 
tweenness of the vertices is sufficiently high, which is 
determined by a parameter s. In this way, most ver- 
tices of the resulting graph are exactly the same one had 
initially, the others are multiple copies of the overlap- 
ping vertices of the initial graph. The overlaps of the 
final cover are obtained by checking if copies of the same 
initial vertex end up in different disjoint clusters. The 
complexity is dominated by the Peacock algorithm, if 
one computes the exact values of the split betweenness 
for the vertices, which requires a time 0(n'^) on a sparse 
graph^^. Gregory proposed an approximate local compu- 
tation, which scales as 0(n log n): in this way the total 
complexity of the method becomes competitive, if one 
chooses a fast algorithm for the identification of the clus- 
ters. The goodness of the results depends on the specific 
method one uses to find the clusters after the graph trans- 
formation. The software of the version of the method 
used by Gregory in his applications can be found at 
http: //www. cs .bris .ac .uk/'^steve/networks/peaco 
ckpaper/. The idea of Gregory is interesting, as it al- 
lows to exploit traditional methods even in the presence 
of overlapping communities. The choice of the parameter 
s, which determines whether a vertex is overlapping or 
not, does not seem to affect significantly the results, as 
long as s is taken sufficiently small. 



XII. MULTIRESOLUTION METHODS AND CLUSTER 
HIERARCHY 

The existence of a resolution limit for Newman-Girvan 
modularity (Section VI. C) implies that the straight opti- 
mization of quality functions yields a coarse description 
of the cluster structure of the graph, at a scale which 
has a priori nothing to do with the actual scale of the 
clusters. In the absence of information on the cluster 
sizes of the graph, a method should be able to explore 
all possible scales, to make sure that it will eventually 
identify the right communities. Multiresolution methods 



The split betweenness needs to be recalculated after each vertex 
split, just as one does for the edge betweenness in the Girvan- 
Newman algorithm (Girvan and Newman, 2002). Therefore both 
computations have the same complexity. 
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are based on this principle. However, many real graphs 
display hierarchical cluster structures, with clusters in- 
side other clusters (Simon, 1962). In these cases, there 
are more levels of organization of vertices in clusters, and 
more relevant scales. In principle, clustering algorithms 
should be able to identify them. Multiresolution meth- 
ods can do the trick, in principle, as they scan continu- 
ously the range of possible cluster scales. Recently other 
methods have been developed, where partitions are by 
construction hierarchically nested in each other. In this 
section we discuss both classes of techniques. 



A. Multiresolution methods 
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In general, multiresolution methods have a freely tun- 
able parameter, that allows to set the characteristic size 
of the clusters to be detected. The general spin glass 
framework by Reichardt and Bornholdt ((Reichardt and 
Bornholdt, 2006a) and Section VLB) is a typical exam- 
ple, where 7 is the resolution parameter. The extension 
of the method to weighted graphs has been recently dis- 
cussed (Heimo et ai, 2008). 

Pons has proposed a method (Pons, 2006) consisting of 
the optimization of multiscale quality functions, includ- 
ing the multiscale modularity 



FIG. 25 Analysis of Zachary's karate club with the multires- 
olution method by Arenas et al. (Arenas et ai, 2008b). The 
plot shows the number of clusters obtained in correspondence 
of the resolution parameter r. The longest plateau (J) indi- 
cates the most stable partition, which exactly matches the so- 
cial fission observed by Zachary. The partition obtained with 
straight modularity optimization (r — 0) consists of four clus- 
ters and is much less stable with respect to (/), as suggested 
by the much shorter length of its plateau. Reprinted figure 
with permission from Ref. (Arenas et al, 2008b). ©2008 by 
lOP Pubhshing. 
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and two other additive quality functions, derived from 
the performance (Eq. 12) and a measure based on the 
similarity of vertex pairs. In Eq. 77 < a < 1 is the 
resolution parameter and the notation is otherwise the 
same as in Eq. 14. We see that, for a = 1/2, one recov- 
ers standard modularity. However, since multiplicative 
factors in do not change the results of the optimiza- 
tion, we can divide Q-^ by a, recovering the same qual- 
ity function as in Eq. 46, with 7 = (1 — a) /a, up to an 
irrelevant multiplicative constant. To evaluate the rele- 
vance of the partitions, for any given multiscale quality 
function. Pons suggested that the length of the a-range 
[<^miniC),C(max(C)], for which a community C "lives" in 
the maximum modularity partition, is a good indicator 
of the stability of the community. He then defined the 
relevance function of a community C at scale a as 
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The relevance R{a) of a partition V at scale a is the 
average of the relevances of the clusters of the partition, 
weighted by the cluster sizes. Peaks in a of R{a) reveal 
the most meaningful partitions. 

Another interesting technique has been devised by Are- 
nas et al. (Arenas et ai, 2008b), and consists of a mod- 
ification of the original expression of modularity. The 



idea is to make vertices contribute as well to the com- 
putation of the edge density of the clusters, by adding 
a self-loop of strength r to each vertex. Arenas et al. 
remarked that the parameter r does not affect the struc- 
tural properties of the graph in most cases, which are 
usually determined by an adjacency matrix without di- 
agonal elements. With the introduction of the vertex 
strength r, modularity reads 



Qr 
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for the general case of a weighted graph. The notation 
is the same as in Eq. 36, Nc is the number of vertices 
in cluster c. We see that now the relative importance of 
the two terms in each summand depends on r, which can 
take any value in ] — 2W^/n, (X)[. Arenas et al. made a 
sweep in the range of r, and determined for each r the 
maximum modularity with extremal optimization (Sec- 
tion VI.A.3) and tabu search^^ (Glover, 1986). Mean- 
ingful cluster structures correspond to plateaus in the 



Tabu search consists in moving single vertices from one com- 
munity to another, chosen at random, or to new communities, 
starting from some initial partition. After a sweep over all ver- 
tices, the best move, i. e. the one producing the largest increase 
of modularity, is accepted and applied, yielding a new partition. 
The procedure is repeated until modularity does not increase 
further. To escape local optima, a list of recent accepted moves 
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plot of the number of clusters versus r (Fig. 25). The 
length of a plateau gives a measure of the stability of the 
partition against the variation of r. The procedure is 
able to disclose the community structure of a number of 
real benchmark graphs. As expected, the most relevant 
partitions can be found in intervals of r not including 
the value r = 0, which corresponds to the case of stan- 
dard modularity (Fig. 25). A drawback of the method is 
that it is very slow, as one has to compute the modular- 
ity maximum for many values of r in order to discrim- 
inate between relevant and irrelevant partitions. If the 
modularity maximum is computed with precise methods 
like simulated annealing and/or extremal optimization, 
as in Ref. (Arenas et al, 2008b), only graphs with a few 
hundred vertices can be analyzed on a single processor. 
On the other hand the algorithm can be trivially paral- 
lelized by running the optimization for different values 
of r on different processors. This is a common feature 
of all multiresolution methods discussed in this Section. 
In spite of the different formal expressions of modularity, 
the methods by Arenas et al. and Reichardt and Born- 
holdt are somewhat related to each other and yield sim- 
ilar results (Kumpula et ai, 2007a) on Zachary's karate 
club (Zachary, 1977) (Section XV. A), synthetic graphs a 
la Ravasz-Barabasi (Ravasz and Barabasi, 2003) and on 
a model graph with the properties of real weighted social 
networks^^. In fact, their modularities can be both re- 
covered from the continuous-time version of the stability 
of clustering under random walk, introduced by Delvenne 
et al. (Delvenne et ai, 2008) (Section VIII.B). 

Lancichinetti et al. have designed a multiresolution 
method which is capable of detecting both the hier- 
archical structure of graphs and overlapping commu- 
nities (Lancichinetti et ai, 2009). It is based on the 
optimization of a fitness function, which estimates the 
strength of a cluster and entails a resolution parameter 
a. The function could in principle be arbitrary, in their 
applications the authors chose a simple ansatz based on 
the tradeoff between the internal and the total degree of 
the cluster. The optimization procedure starts from a 
cluster with a single vertex, arbitrarily selected. Given a 
cluster core, one keeps adding and removing neighboring 
vertices of the cluster as long as its fitness increases. The 
fitness is recalculated after each addition/removal of a 
vertex. At some point one reaches a local maximum and 
the cluster is "closed". Then, another vertex is chosen 
at random, among those not yet assigned to a cluster, 
a new cluster is built, and so on, until all vertices have 



is kept and updated, so that those moves are not accepted in 
the next update of the configuration (tabu list). The cost of 
the procedure is about the same of other stochastic optimization 
techniques like, e. g., simulated annealing. 

Related does not mean equivalent, though. Arenas et al. have 
shown that their method is better than that by Reichardt and 
Bornholdt when the graph at hand includes communities of dif- 
ferent sizes (Arenas et al., 2008b). 



been assigned to clusters. During the buildup of a clus- 
ter, vertices already assigned to other clusters may be 
included, i. e. communities may overlap. The computa- 
tional complexity of the algorithm, estimated on sparse 
Erdos-Renyi random graphs, is O(n^), with /3 ^ 2 for 
small values of the resolution parameter a, and (3 ^ 1 ii 
a is large. For a complete analysis, the worst-case compu- 
tational complexity is 0{n^ logn), where the factor logn 
comes from the minimum number of different a-values 
which are needed to resolve the actual community struc- 
ture of the graph. Relevant partitions are revealed by 
pronounced spikes in the histogram of the fitness values 
of covers obtained for different a-values, where the fitness 
of a cover is defined as the average fitness of its clusters. 

A technique based on the Potts model, similar to that 
of Reichardt and Bornholdt (Reichardt and Bornholdt, 
2006a), has been suggested by Ronhovde and Nussi- 
nov (Ronhovde and Nussinov, 2008). The energy of their 
spin model is 

^ii'^}) ^ Et^'^- - ~ ^'^jM'^^^ ^i)- (80) 

The big difference with Eq. 46 is the absence of a null 
model term. The model considers pairs of vertices in the 
same community: edges between vertices are energeti- 
cally rewarded, whereas missing edges are penalized. The 
parameter 7 fixes the tradeoff between the two contribu- 
tions. The energy is minimized by sequentially shifting 
single vertices/spins to the communities which yield the 
largest decrease of the system's energy, until convergence. 
If, for each vertex, one just examines the communities of 
its neighbors, the energy is minimized in a time O(m^), 
where j3 turns out to be slightly above 1 in most appli- 
cations, allowing for the analysis of large graphs. This 
essentially eliminates the problem of limited resolution, 
as the criterion to decide about the merger or the split 
of clusters only depends on local parameters. Still, for 
the detection of possible hierarchical levels tuning 7 is 
mandatory. In a successive paper (Ronhovde and Nussi- 
nov, 2009), the authors have introduced a new stability 
criterion for the partitions, consisting of the computa- 
tion of the similarity of partitions obtained for the same 
7 and different initial conditions. The idea is that, if 
a partition is robust in a given range of 7-values, most 
replicas delivered by the algorithm will be very similar. 
On the other hand, if one explores a region of resolutions 
in between two strong partitions, the algorithm will de- 
liver the one or the other partition and the individual 
replicas will be, on average, not so similar to each other. 
So, by plotting the similarity as a function of the reso- 
lution parameter 7, stable communities are revealed by 
peaks. Ronhovde and Nussinov adopted similarity mea- 
sures borrowed from information theory (Section XV. B). 
Their criterion of stability can be adopted to determine 
the relevance of partitions obtained with any multireso- 
lution algorithm. 

A general problem of multiresolution methods is how 
to assess the stability of partitions for large graphs. The 
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rapidly increasing number of partitions, obtained by min- 
imal shifts of vertices between clusters, introduces a large 
amount of noise, that blurs signatures of stable partitions 
like plateaus, spikes, etc. that one can observe in small 
systems. In this respect, it seems far more reliable focus- 
ing on correlations between partitions (like the average 
similarity used by Ronhovde and Nussinov (Ronhovdc 
and Nussinov, 2008; Ronhovdc and Nussinov, 2009)) than 
on properties of the individual partitions (like the mea- 
sures of occurrence used by Arenas et al. (Arenas et a/., 
2008b) and by Lancichinetti et al. (Lancichinetti et ai, 
2009)). 



B. Hierarchical methods 

The natural procedure to detect the hierarchical struc- 
ture of a graph is hierarchical clustering, that we have 
discussed in Section IV. B. There we have emphasized 
the main weakness of the procedure, which consists of 
the necessity to introduce a criterion to identify relevant 
partitions (hierarchical levels) out of the full dendrogram 
produced by the given algorithm. Furthermore, there is 
no guarantee that the results indeed reflect the actual hi- 
erarchical structure of the graph, and that they are not 
mere artefacts of the algorithm itself. Scholars have just 
started to deal with these problems. 

Sales-Pardo et al. have proposed a top-down ap- 
proach (Salcs-Pardo et at, 2007). Their method con- 
sists of two steps: 1) measuring the similarity between 
vertices; 2) deriving the hierarchical structure of the 
graph from the similarity matrix. The similarity mea- 
sure, named node affinity, is based on Newman- Girvan 
modularity. Basically the affinity between two vertices 
is the frequency with which they coexist in the same 
community in partitions corresponding to local optima 
of modularity. The latter are configurations for which 
modularity is stable, i. e. it cannot increase if one shifts 
one vertex from one cluster to another or by merging or 
splitting clusters. The set of these partitions is called 
T-'max- Before proceeding with the next step, one verifies 
whether the graph has a significant community structure 
or not. This is done by calculating the z-score (Eq. 51) 
for the average modularity of the partitions in Vmax with 
respect to the average modularity of partitions with lo- 
cal modularity optima of the equivalent ensemble of null 
model graphs, obtained as usual by randomly rewiring 
the edges of the original graph under the condition that 
the expected degree sequence is the same as the degree 
sequence of the graph. Large z-scores indicate meaning- 
ful cluster structure: Sales-Pardo et al. used a threshold 
corresponding to the 1% significance leveP^. If the graph 



has a relevant cluster structure, one proceeds with the 
second step, which consists in putting the affinity matrix 
in a form as close as possible to block-diagonal, by min- 
imizing a cost function expressing the average distance 
of connected vertices from the diagonal. The blocks cor- 
respond to the communities and the recovered partition 
represents the uppermost organization level. To deter- 
mine lower levels, one iterates the procedure for each 
subgraph identified at the previous level, which is treated 
as an independent graph. The procedure stops when all 
blocks found do not have a relevant cluster structure, i. e. 
their z-scores are lower than the threshold. The parti- 
tions delivered by the method are hierarchical by con- 
struction, as communities at each level are nested within 
communities at higher levels. However, the method may 
find no relevant partition (no community structure), a 
single partition (community structure but no hierarchy) 
or more (hierarchy) and in this respect it is better than 
most existing methods. The algorithm is not fast, as both 
the search of local optima for modularity and the rear- 
rangement of the similarity matrix are performed with 
simulated annealing^^, but delivers good results for com- 
puter generated networks, and meaningful partitions for 
some real networks, like the world airport network (Bar- 
rat et al., 2004), an email exchange network of a Catalan 
university (Giumcra et al., 2003), a network of electronic 
circuits (Itzkovitz et al., 2005) and metabolic networks 
of E. coli (Guimcra et al., 2007). 

Clauset et al. (Clausct et al, 2007; Clausct et al, 2008) 
described the hierarchical organization of a graph by in- 
troducing a class of hierarchical random graphs. A hi- 
erarchical random graph is defined by a dendrogram V, 
which is the natural representation of the hierarchy, and 
by a set of probabilities {pr} associated to the n—1 inter- 
nal nodes of the dendrogram. An ancestor of a vertex i is 
any internal node of the dendrogram that is encountered 
by starting from the "leaf" vertex i and going all the way 
up to the top of the dendrogram. The probability that 
vertices i and j are linked to each other is given by the 
probability pr of the lowest common ancestor of i and 
j. Clauset et al. searched for the model (P, {p^}) that 
best fits the observed graph topology, by using Bayesian 
inference (Section IX. A). The probability that the model 
fits the graph is proportional to the likelihood 

C{V,{pr}) = Hp^'-il-pr)'^"''--''^. (81) 

rev 

Here, is the number of edges connecting vertices whose 
lowest common ancestor is r, Lr and Rr are the numbers 
of graph vertices in the left and right subtrees descend- 
ing from the dendrogram node r, and the product runs 



We remind that the significance of the 2;-score has to be com- 
puted with respect to the actual distribution of the maximum 
modularity for the nuU model graphs, as the latter is not Gaus- 
sian (Section VI. C). 



The reordering of the matrix is by far the most time-consuming 
part of the method. The situation improves if one adopts faster 
optimization strategies than simulated annealing, at the cost of 
less accurate results. 
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FIG. 26 Hierarchical random graphs by Clauset et 
al. (Clauset et at, 2008). The picture shows two possible den- 
drograms for the simple graph on the top. The linking prob- 
abilities on the internal nodes of the dendrograms yield the 
best fit of the model graphs to the graph at study. Reprinted 
figure with permission from Ref. (Clauset et al., 2008). ©2008 
by the Nature Publishing Group. 
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FIG. 27 Possible scenarios in the evolution of communi- 
ties. Reprinted figure with permission from Ref. (Palla et al., 
2007). ©2007 by the Nature Publishing Group. 



over all internal dendrogram nodes. For a given dendro- 
gram T), the maximmn likelihood C{'D) corresponds to 
the set of probabilities {pr}, where pr equals the actual 
density of edges Er/{LrRr) between the two subtrees of 
r (Fig. 26). One can define the statistical ensemble of 
hierarchical random graphs describing a given graph Q, 
by assigning to each model graph [T), {pr}) a probability 
proportional to the maximum likelihood C{'D). The en- 
semble can be sampled by a Markov chain Monte Carlo 
method (Newman and Barkema, 1999). The procedure 
suggested by Clauset et al. seems to converge to equilib- 
rium roughly in a time O(n^), although the actual com- 
plexity may be much higher. Still, the authors were able 
to investigate graphs with a few thousand vertices. From 
sufficiently large sets of model configurations sampled at 
equilibrium, one can compute average properties of the 
model, e. g. degree distributions, clustering coefficients, 
etc., and compare them with the corresponding proper- 
ties of the original graph. Tests on real graphs reveal that 
the model is indeed capable to describe closely the graph 
properties. Furthermore, the model enables one to pre- 
dict missing connections between vertices of the original 
graph. This is a very important problem (Liben-Nowcll 
and Kleinberg, 2003) : edges of real graphs are the result 
of observations/experiments, that may fail to discover 
some relationships between the units of the system. From 
the ensemble of the hierarchical random graphs one can 
derive the average linking probability between all pairs 
of graph vertices. By ranking the probabilities corre- 
sponding to vertex pairs which are disconnected in the 
original graph, one may expect that the pairs with high- 
est probabilities are likely to be connected in the system, 
even if such connections are not observed. Clauset et al. 
pointed out that their method does not deliver a sharp 
hierarchical organization for a given graph, but a class of 
possible organizations, with well-defined probabilities. It 
is certainly reasonable to assume that many structures 



are compatible with a given graph topology. In the case 
of community structure, it is not clear which informa- 
tion one can extract from averaging over the ensemble of 
hierarchical random graphs. Moreover, since the hierar- 
chical structure is represented by a dendrogram, it is im- 
possible to rank partitions according to their relevance. 
In fact, the work by Clauset et al. questions the con- 
cept of "relevant partition" , and opens a debate in the 
scientific community about the meaning itself of graph 
clustering. The software of the method can be found at 
http: //www. santaf e . edu/^aaronc/hierarchy/. 



XIII. DETECTION OF DYNAMIC COMMUNITIES 

The analysis of dynamic communities is still in its in- 
fancy. Studies in this direction have been mostly hin- 
dered by the fact that the problem of graph clustering 
is already controversial on single graph realizations, so 
it is understandable that most efforts still concentrate 
on the "static" version of the problem. Another diffi- 
culty is represented by the dearth of timestamped data 
on real graphs. Recently, several data sets have become 
available, enabling to monitor the evolution in time of 
real systems (Kumar et al., 2003, 2006; Leskovec et al., 
2008, 2005). So it has become possible to investigate how 
communities form, evolve and die. The main phenomena 
occurring in the lifetime of a community are (Fig. 27): 
birth, growth, contraction, merger with other communi- 
ties, split, death. 

The first study was carried out by Hopcroft et 
al. (Hopcroft et al., 2004), who analyzed several snap- 
shots of the citation graph induced by the NEC CiteSeer 
Database (Giles et al., 1998). The snapshots cover the 
period from 1990 to 2001. Communities are detected 
by means of (agglomerative) hierarchical clustering (Sec- 
tion IV. B), where the similarity between vertices is the 
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cosine similarity of the vectors describing the correspond- 
ing papers, a well known measure used in information 
retrieval (Baeza- Yates and Ribeiro-Neto, 1999). In each 
snapshot Hopcroft et al. identified the natural communi- 
ties, defined as those communities of the hierarchical tree 
that are only slightly affected by minor perturbations of 
the graph, where the perturbation consists in removing 
a small fraction of the vertices (and their edges). Such 
natural communities are conceptually similar to the sta- 
ble communities we will see in Section XIV. Hopcroft et 
al. found the best matching natural communities across 
different snapshots, and in this way they could follow the 
history of communities. In particular they could see the 
emergence of new communities, corresponding to new re- 
search topics. The main drawback of the method comes 
from the use of hierarchical clustering, which is unable 
to sort out meaningful communities out of the hierarchi- 
cal tree, which includes many different partitions of the 
graph. 

More recently, Palla et al. performed a systematic 
analysis of dynamic communities (Palla et al., 2007). 
They studied two social systems: 1) a graph of phone 
calls between customers of a mobile phone company in 
a year's time; 2) a collaboration network between scien- 
tists, describing the coauthorship of papers in condensed 
matter physics from the electronic e-print archive (cond- 
mat) maintained by Cornell University Library, spanning 
a period of 142 months. The first problem is identifying 
the image of a community C(t-\-l) at time t-\-l among the 
communities of the graph at time t. A simple criterion, 
used in other works, is to measure the relative overlap 
(Eq. 97) of C(t + 1) with all communities at time t, and 
pick the community which has the largest overlap with 
C{t + 1). This is intuitive, but in many cases it may miss 
the actual evolution of the community. For instance, if 
C{t) at time t -\- 1 grows considerably and overlaps with 
another community B{t -\- 1) (which at the previous time 
step was disjoint from C{t)), the relative overlap between 
C{t-\- 1) and B{t) may be larger than the relative overlap 
between C{t -\- 1) and C{t). It is not clear whether there 
is a general prescription to avoid this problem. Palla et 
al. solved it by exploiting the features of the Clique Per- 
colation Method (CPM) (Section XL A), that they used 
to detect communities. The idea is to analyze the graph 
G{t,t 1), obtained by merging the two snapshots G{t) 
and G{t-\-l) of the evolving graph, at times t and t + 1 (i. 
e., by putting together all their vertices and edges). Any 
CPM community of G{t) and G{t + 1) does not get lost, 
as it is included within one of the CPM communities of 
Git, t -\- 1). For each CPM community Vk of G{t, t + 1), 
one finds the CPM communities {C[} and {C*+^} (of ^(t) 
and G{t-\-l), respectively) which are contained in Vfe. The 
image of any community in {C^"''"'^} at time t is the com- 
munity of {Cl} that has the largest relative overlap with 
it. 

The age r of a community is the time since its birth. 
It turns out that the age of a community is positively 
correlated with its size s(t), i. e. that older communities 



are also larger (on average). The time evolution of a 
community C can be described by means of the relative 
overlap C (t) between states of the community separated 
by a time t: 



C{t) 



\C{to)^C{to + t)\ 
|C(io)UC(to+i)r 



(82) 



One finds that, in both data sets, C{t) decays faster for 
larger communities, so the composition of large commu- 
nities is rather variable in time, whether small commu- 
nities are essentially static. Another important question 
is whether it is possible to predict the evolution of com- 
munities from information on their structure or on their 
vertices. In Fig. 28a the probability pi that a vertex will 
leave the community in the next step of the evolution is 
plotted as a function of the relative external strength of 
the vertex, indicating how much of the vertex strength 
lies on edges connecting it to vertices outside its com- 
munity. The plot indicates that there is a clear positive 
correlation: vertices which are only loosely connected to 
vertices of their community have a higher chance (on av- 
erage) to leave the community than vertices which are 
more "committed" towards the other community mem- 
bers. The same principle holds at the community level 
too. Fig. 28b shows that the probability pd that a com- 
munity will disintegrate in the next time step is posi- 
tively correlated with the relative external strength of 
the community. Finally, Palla et al. have found that the 
probability for two communities to merge increases with 
the community sizes much more than what one expects 
from the size distribution, which is consistent with the 
faster dynamics observed for large communities. Palla et 
al. analyzed two different real systems, a network of mo- 
bile phone communications and a coauthorship network, 
to be able to infer general properties of community evo- 
lution. However, communities were only found with the 
CPM, so their results need to be cross-checked by em- 
ploying other clustering techniques. 

Asur et al. (Asur et al., 2007) explored the dynamic 
relationship between vertices and communities. Commu- 
nities were found with the MCL method by Van Don- 
gen (Dongcn, 2000a) (Section VIII.B), by analyzing the 
graph at different timestamps. Asur et al. distinguished 
events involving communities and events involving the 
vertices. Events involving communities are Continue (the 
community keeps most of its vertices in consecutive time 
steps), K-Merge (two clusters merge into another), n-Split 
(two clusters split in two parts) , Form (no pair of vertices 
of the cluster at time t -I- 1 were in the same cluster at 
time t) and Dissolve (opposite of Form). Events involv- 
ing vertices are Appear (if a vertex joins the graph for the 
first time). Disappear (if a vertex of the graph at time t 
is no longer there at time t -\- 1), Join (if a vertex of a 
cluster at time t -\- 1 was not in that cluster at time t) 
and Leave (if a vertex which was in a cluster at time t is 
not in that cluster at time t -\- 1). Based on such events, 
four measures are defined in order to catch the behav- 
ioral tendencies of vertices contributing to the evolution 
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FIG. 28 Relation between structural features and evolution 
of a community, a) Relation between the probability that a 
vertex will abandon the community in the next time step and 
its relative external strength, b) Relation between the prob- 
ability of disintegration of a community in the next time step 
and its relative external strength. Reprinted figure with per- 
mission from Ref. (Palla et al, 2007). ©2007 by the Nature 
Publishing Group. 



of the graph: the stability index (measuring the tendency 
of a vertex to interact with the same vertices over time), 
the sociability index (measuring the number of different 
interactions of a vertex, basically the number of Join and 
Leave events), the popularity index (measuring the num- 
ber of vertices attracted by a cluster in a given time in- 
terval) and the influence index (measuring the influence 
a vertex has on the others, which is computed from the 
number of vertices that leave or join a cluster together 
with the vertex) . Applications on a coauthorship network 
of computer scientists and on a network of subjects for 
clinical trials show that the behavioral measures above 
enable one to make reliable predictions about the time 
evolution of such graphs (including, e. g., the inference 
of missing links (Liben-Nowcll and Kleinberg, 2003)). 

Dynamic communities can be as well detected with 
methods of information compression, such as some of 
those we have seen in Section IX. B. Sun et al. (Sun 



et al., 2007) applied the Minimum Description Length 
(MDL) principle (Griinwald et al, 2005; Rissanen, 1978) 
to find the minimum encoding cost for the description 
of a time sequence of graphs and their partitions in 
communities. The method is quite similar to that suc- 
cessively developed by Rosvall and Bergstrom (Rosvall 
and Bergstrom, 2007), which is however defined only for 
static graphs (Section IX. B). Here one considers bipartite 
graphs evolving in time. The time sequence of graphs can 
be separated in segments, each containing some number 
of consecutive snapshots of the system. The graphs of 
each segment are supposed to have the same modular 
structure (i. e. they represent the same phase in the 
history of the system), so they are characterized by the 
same partition of the two vertex classes. For each graph 
segment it is possible to define an encoding cost, which 
combines the encoding cost of the partition of the graphs 
of the segment with the entropy of compression of the seg- 
ment in the subgraph segments induced by the partition. 
The total encoding cost C of the graph series is given 
by the sum of the encoding costs of its segments. Mini- 
mizing C enables one to find not only the most modular 
partition for each graph segment (high modularity^^ cor- 
responds to low encoding costs for a partition), but also 
the most compact subdivision of the snapshots into seg- 
ments, such that graphs in the same segment are strongly 
correlated with each other. The latter feature allows to 
identify change points in the time history of the system, 
i. e. short periods in which the dynamics produces big 
changes in the graph structure (corresponding to, e.g., 
extreme events). The minimization of C is NP-hard, 
so the authors propose an approximation method called 
GraphScope, which consists of two steps: first, one looks 
for the best partition of each graph segment; second, one 
looks for the best division in segments. In both cases 
the "best" result corresponds to the minimal encoding 
cost. The best partition within a graph segment is found 
by local search. GraphScope has the big advantage not 
to require any input, like the number and sizes of the 
clusters. It is also suitable to operate in a streaming en- 
vironment, in which new graph configurations are added 
in time, following the evolution of the system: the com- 
putational complexity required to process a snapshot (on 
average) is stable over time. Tests on real evolving data 
sets show that GraphScope is able to find meaningful 
communities and change points. 

Since keeping track of communities in different time 
steps is not a trivial problem, as we have seen above, it 
is perhaps easier to adopt a vertex-centric perspective, 
in which one monitors the community of a given vertex 
at different times. For any method, given a vertex i and 
a time i, the community to which i belongs at time t is 



We stress that here by modularity we mean the feature of a graph 
having community structure, not the modularity of Newman and 
Girvan. 
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well defined. Fenn et al. (Fcnn et ai, 2009) used the mul- 
tiresolution method by Reichardt et al. (Reichardt and 
Bornholdt, 2006a) (Section VLB) and investigated a fully 
connected graph with time-dependent weights, represent- 
ing the correlations of time series of hourly exchange rate 
returns. The resolution parameter 7 is fixed to the value 
that occurs in most stability plateaus of the system at 
different time steps. Motivated by the work of Guimera 
and Amaral (Guimera and Amaral, 2005) (Section XVI), 
Fenn et al. identify the role of individual vertices in their 
community through the pair (z™,^;^), where is the 
z-score of the internal strength (weighted degree. Sec- 
tion A.l), defined in Eq. 98, and the z-score of the 
site betweenness, defined by replacing the internal degree 
with the site betweenness of Freeman (Freeman, 1977) in 
Eq. 98. We remind that the site betweenness is a measure 
of the number of shortest paths running through a ver- 
tex. The variable expresses the importance of a vertex 
in processes of information diffusion with respect to the 
other members of its community. Another important is- 
sue regards the persistence of communities in time, i. e. 
how stable they are during the evolution. As a measure 
of persistence, Fenn et al. introduced a vertex-centric 
version of the relative overlap of Eq. 82 
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where i is the vertex and Ci{t), Ci(t + T) the communities 
of i at times t, t + t, respectively. The decay of a*(T) 
depends on the type of vertex. In particular, if the ver- 
tex is strongly connected to its community (z*" large), 
a*(r) decays quite slowly, meaning that it tends to stay 
attached to a stable core of vertices. 

The methods described above are basically two-stage 
approaches, in which clusters are detected at each times- 
tamp of the graph evolution, independently of the results 
at different times, and relationships between partitions 
at different times are inferred successively. However, this 
often produces significant variations between partitions 
close in time, especially when, as it usually happens, the 
datasets are noisy. In this case one would be forced to 
introduce ad hoc hypotheses on the graph evolution to 
justify the variability of the modular structure, whereas 
such variability is mostly an artefact of the approach. 
It would be desirable instead to have a unified frame- 
work, in which clusters are deduced both from the cur- 
rent structure of the graph and from the knowledge of 
the cluster structure at previous times. This is the prin- 
ciple of evolutionary clustering, a framework introduced 
by Chakrabarti et al. (Chakrabarti et ai, 2006) and suc- 
cessively adopted and refined by other authors. Let Ct 
be the partition of the graph at time t. The snapshot 
quality of Ct measures the goodness of the partition with 
respect to the graph structure at time t. The history 
cost is a measure of the distance/dissimilarity of Ct with 
respect to the partition Ct-i at the previous time step. 
The overall quality of Ct is given by a combination of the 
snapshot quality and the history cost at each time step. 



Ideally, a good partition should have high snapshot qual- 
ity (i. e. it should cluster well the data at time t) and low 
history cost (i. e. it should be similar to the partition 
at the previous time step). In order to find Ct from Ct-i 
and the relational data at time t Chakrabarti et al. sug- 
gested to minimize the difference between the snapshot 
quality and the history cost, with a relative weight cp 
that is a tunable parameter. The input of the procedure 
consists in the sequence of adjacency/similarity matrices 
at different time steps. In practice, one could use mod- 
ified versions of such matrices, obtained by performing 
(weighted) averages of the data over some time window, 
in order to make the relational data more robust against 
noise and the results of the clustering procedure more re- 
liable. One can adopt arbitrary measures to compute the 
snapshot quality and the historical cost. Besides, sev- 
eral known clustering techniques used for static graphs 
can be reformulated within this evolutionary framework. 
Chakrabarti et al. derived evolutionary versions of hi- 
erarchical clustering (Section IV. B) and fc-means clus- 
tering (Section IV. C), whereas Chi et al. (Chi et ai, 
2007) designed two implementations for spectral cluster- 
ing (Section IV. D). Based on evolutionary clustering, Lin 
et al. (Lin et ai, 2008) introduced a framework, called 
FacetNet, that allows vertices to belong to more commu- 
nities at the same time. Here the snapshot cost^* is the 
KuUback-Leibler (KL) divergence (KuUback and Lcibler, 
1951) between the adjacency/similarity matrix at time 
t and the matrix describing the community structure of 
the graph at time t; the history cost is the KL divergence 
between the matrices describing the community structure 
of the graph at times t — 1 and t. FacetNet can be ex- 
tended to handle adding and removing of vertices as well 
as variations of the number of clusters in consecutive time 
steps. However, it is not able to account for the creation 
and the disintegration of communities and not scalable to 
large systems due to the high number of iterations neces- 
sary for the matrix computations to reach convergence. 
These issues have been addressed in a recent approach 
by Kim and Han (Kim and Han, 2009). 

Naturally, what one hopes to achieve at the end of the 
day is to see how real groups form and evolve in time. 
Backstrom et al. (Backstrom et ai, 2006) have carried out 
an analysis of group dynamics in the free online commu- 
nity of LiveJournal (http://www.livejournal.com/) 
and in a coauthorship network of computer scientists. 
Here the groups are identified through the declared mem- 
berships of users (for LiveJournal) and conferences at- 
tended by computer scientists, respectively. Backstrom 
and coworkers have found that the probability that an 
individual joins a community grows with the number of 
friends / coauthors who are already in the community and 



Lin et al. used the cost and not the quahty to evaluate the fit 
of the partition to the data. The two estimates are obviously 
related; the lower the cost, the higher the quality. 
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(for LiveJournal) with their degree of interconnectedness. 
Moreover, the probabihty of growth of LiveJournal com- 
munities is positively correlated to a combination of fac- 
tors including the community size, the number of friends 
of community members which are not in the community 
and the ratio of these two numbers. A high density of 
triads within a community appears instead to hinder its 
growth. 



XIV. SIGNIFICANCE OF CLUSTERING 

Given a network, many partitions could represent 
meaningful clusterings in some sense, and it could be dif- 
ficult for some methods to discriminate between them. 
Quality functions evaluate the goodness of a partition 
(Section III.C.2), so one could say that high quality cor- 
responds to meaningful partitions. But this is not nec- 
essarily true. In Section VI. C we have seen that high 
values of the modularity of Newman and Girvan do not 
necessarily indicate that a graph has a definite cluster 
structure. In particular we have seen that partitions of 
random graphs may also achieve considerably large val- 
ues of Q, although we do not expect them to have commu- 
nity structure, due to the lack of correlations between the 
linking probabilities of the vertices. The optimization of 
quality functions, like modularity, delivers the best par- 
tition according to the criterion underlying the quality 
function. But is the optimal clustering also significant, 
i. e. a relevant feature of the graph, or is it just a byprod- 
uct of randomness and basic structural properties like, e. 
g., the degree sequence? Little effort has been devoted 
to this crucial issue, that we discuss here. 

In some works the concept of significance has been 
related to that of robustness or stability of a partition 
against random perturbations of the graph structure. 
The basic idea is that, if a partition is significant, it will 
be recovered even if the structure of the graph is mod- 
ified, as long as the modification is not too extensive. 
Instead, if a partition is not significant, one expects that 
minimal modifications of the graph will suffice to disrupt 
the partition, so other clusterings are recovered. A nice 
feature of this approach is the fact that it can be applied 
for any clustering technique. Gfeller et al. (Gfeller et ai, 
2005) considered the general case of weighted graphs. A 
graph is modified, in that its edge weights are increased 
or decreased by a relative amount < a < 1. This 
choice also allows to account for the possible effects of 
uncertainties in the values of the edge weights, resulting 
from measurements/experiments carried out on a given 
system. After fixing a (usually to 0.5), multiple realiza- 
tions of the original graph are generated. The best par- 
tition for each realization is identified and, for each pair 
of adjacent vertices i and j, the m-cZwster probability pij 
is computed, i. e. the fraction of realizations in which 
i and j were classified in the same cluster. Edges with 
in-cluster probability smaller than a threshold 9 (usually 
0.8) are called external edges. The stability of a partition 



is estimated through the clustering entropy 

S = -— [p,j log2P,j - (1 -_Py)log2(l -Pij)], 

(84) 

where m is, as usual, the number of graph edges, and 
the sum runs over all edges. The most stable partition 
has Pij = along inter-cluster edges and pij = 1 along 
intra-cluster edges, which yields 5 = 0; the most unstable 
partition has pij = 1/2 on all edges, yielding S — 1. The 
absolute value of 5* is not meaningful, though, and needs 
to be compared with the corresponding value for a null 
model graph, similar to the original graph, but with sup- 
posedly no cluster structure. Gfeller et al. adopted the 
same null model of Newman-Girvan modularity, i. e. the 
class of graphs with expected degree sequence coinciding 
with that of the original graph. Since the null model is 
defined on unweighted graphs, the significance of S can 
be assessed only in this case, although it would not be 
hard to think of a generalization to weighted graphs. The 
approach enables one as well to identify unstable vertices, 
i. e. vertices lying at the boundary between clusters. In 
order to do that, the external edges are removed and 
the connected components of the resulting disconnected 
graph are associated with the clusters detected in the 
original graph, based on their relative overlap (computed 
through Eq. 97) . Unstable vertices end up in components 
that are not associated to any of the initial clusters. A 
weakness of the method by Gfeller et al. is represented by 
the two parameters a and 9, whose values are in principle 
arbitrary. 

More recently, Karrer et al. (Karrer et al., 2008) 
adopted a similar strategy to unweighted graphs. Here 
one performs a sweep over all edges: the perturbation 
consists in removing each edge with a probability a and 
replacing it with another edge between a pair of vertices 
{i,j), chosen at random with probability pij = kikj/2m, 
where ki and kj are the degrees of i and j. We recog- 
nize the probability of the null model of Newman-Girvan 
modularity. Indeed, by varying the probability a from 
to 1 one smoothly interpolates between the original 
graph (no perturbation) and the null model (maximal 
perturbation). The degree sequence of the graph remains 
invariant (on average) along the whole process, by con- 
struction. The idea is that the perturbation affects solely 
the organization of the vertices, keeping the basic struc- 
tural properties. For a given value of a, many realiza- 
tions of the perturbed graph are generated, their cluster 
structures are identified with some method (Karrer et al. 
used modularity optimization) and compared with the 
partition obtained from the original unperturbed graph. 
The partitions are compared by computing the variation 
of information V (Section XV. B). From the plot of the 
average {V) versus a one can assess the stability of the 
cluster structure of the graph. If {V{a)) changes rapidly 
for small values of a the partition is likely to be unsta- 
ble. As in the approach by Gfeller et al. the behaviour of 
the function {V{a)) does not have an absolute meaning. 
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but needs to be compared with the corresponding curve 
obtained for a null model. For consistency, the natural 
choice is again the null model of modularity, which is al- 
ready used in the process of graph perturbation. The ap- 
proaches by Gfeller et al. and Karrer et al., with suitable 
modifications, can also be used to check for the stability 
of the cluster structure in parts of a graph, up to the level 
of individual communities. This is potentially important 
as it may happen that some parts of the graph display a 
strong community structure and other parts weak or no 
community structure at all. 

Rosvall and Bergstrom (Rosvall and Bergstrom, 2008) 
defined the significance of clusters with the bootstrap 
method (Efron and Tibshirani, 1993), which is a stan- 
dard procedure to check for the accuracy of a measure- 
ment/estimate based on resampling from the empirical 
data. The graph at study is supposed to be generated by 
a parametric model, which is used to create many sam- 
ples. This is done by assigning to each edge a weight 
taken by a Poisson distribution with mean equal to the 
original edge weight. For the initial graph and each sam- 
ple one identifies the community structure with some 
method, that can be arbitrary. For each cluster of the 
partition of the original graph one determines the largest 
subset of vertices that are classified in the same clus- 
ter in at least 95% of all bootstrap samples. Identifying 
such cluster cores enables one to track the evolution in 
time of the community structure, as we explained in Sec- 
tion XIII. 

A different approach has been proposed by Massen 
and Doye (Massen and Doye, 2006). They analyzed an 
equilibrium canonical ensemble of partitions, with —Q 
playing the role of the energy, Q being Newman-Girvan 
modularity. This means that the probability of occur- 
rence of a partition at temperature T is proportional to 
exp((3/T). The idea is that, if a graph has a signifi- 
cant cluster structure, at low temperatures one would 
recover essentially the same partition, corresponding to 
the modularity maximum, which is separated by an ap- 
preciable gap from the modularity values of the other 
partitions. On the contrary, graphs with no commu- 
nity structure, e. g. random graphs, have many com- 
peting (local) maxima, and the corresponding configura- 
tions will emerge already at low temperatures, since their 
modularity values are close to the absolute maximum^^. 
These distinct behaviors can manifest themselves in var- 
ious ways. For instance, if one considers the variation of 
the specific heat C — ~dQ/dT with T, the gap in the 
modularity landscape is associated to a sharp peak of 
C around some temperature value, like it happens in a 



As we have seen in Section VI. C, Good et al. (Good et al., 2009) 
have actually shown that the modularity landscape has a huge 
degeneracy of states with high modularity values, close to the 
global maximum, especially on graphs with community struc- 
ture. So the results of the method by Massen and Doye may be 
misleading. 



phase transition. If the gap is small and there are many 
partitions with similar modularity values, the peak of C 
becomes broad. Another strategy to assess the signifi- 
cance of the maximum modularity partition consists of 
the investigation of the similarity between partitions re- 
covered at a given temperature T . This similarity can 
be expressed by the frequency matrix, whose element f^j 
indicates the relative number of times vertices i and j 
have been classified in the same cluster. If the graph 
has a clear community structure, at low temperatures 
the frequency matrix can be put in block-diagonal form, 
with the blocks corresponding to the communities of the 
best partition; if there is no significant community struc- 
ture, the frequency matrix is rather homogeneous. The 
Fiedler eigenvalue (Fiedler, 1973) A2, the second smallest 
eigenvalue of the Laplacian matrix associated to the fre- 
quency matrix, allows to estimate how "block-diagonal" 
the matrix is (see Section IV. A). At low temperatures 
A2 ^ if there is one (a few) partitions with maximum 
or near to maximum modularity; if there are many (al- 
most) degenerate partitions, A2 is appreciably different 
from zero even when T — 0. A sharp transition from 
low to high values of A2 by varying temperature indicates 
significant community structure. Another clear signature 
of significant community structure is the observation of 
a rapid drop of the average community size with T, as 
"strong" communities break up in many small pieces for 
a modest temperature increase, while the disintegration 
of "weak" communities takes place more slowly. In scale- 
free graphs (Section A. 3) clusters are often not well sep- 
arated, due to the presence of the hubs; in these cases 
the above-mentioned transitions of ensemble variables are 
not so sharp and take place over a broader temperature 
range. The canonical ensemble of partitions is generated 
through single spin heatbath simulated annealing (Re- 
ichardt and Bornholdt, 2006a), combined with parallel 
tempering (Earl and Deem, 2005). The approach by 
Massen and Doye could be useful to recognize graphs 
without cluster structure, if the modularity landscape is 
characterized by many maxima with close values (but see 
Footnote). However, it can happen that gaps between the 
absolute modularity maximum and the rest of the modu- 
larity values are created by fluctuations, and the method 
is unable to identify these situations. Furthermore, the 
approach heavily relies on modularity and on a costly 
technique like simulated annealing: extensions to other 
quality functions and/or optimization procedures do not 
appear straightforward. 

In a recent work by Bianconi et al. (Bianconi et al., 
2009) the notion of entropy of graph ensembles (Bianconi, 
2008; Bianconi et al., 2008) is employed to find out how 
likely it is for a cluster structure to occur on a graph with 
a given degree sequence. The entropy is computed from 
the number of graph configurations which are compatible 
with a given classification of the vertices in q groups. 
The clustering is quantitatively described by fixing the 
number of edges ^(91,92) running between clusters qi 
and q2 , for all choices of qi 7^ 92 • Bianconi et al. proposed 
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the following indicator of clustering significance 



e.- - = 

K.q 



k.q 



(85) 



where Eg - is the entropy of the graph configurations with 

given degree sequence k and clustering q (with fixed num- 
bers of inter-cluster edges A(qi,q2)), and (Sj:^|.^)Tr is 
the average entropy of the configurations with the same 
degree sequence and a random permutation Tr{q} of the 
cluster labels. The absolute value of the entropy - is 
not meaningful, so the comparison of Eg ^ and (Eg ^(^^)tt 
is crucial, as it tells how relevant the actual cluster struc- 
ture is with respect to a random classification of the ver- 
tices. However, different permutations of the assignments 
q yield different values of the entropy, which can fluctuate 
considerably. Therefore one has to compute the standard 
deviation ((5E|^^_^)^ of the entropy corresponding to all 

random permutations Tr{q), to estimate how significant 
the difference between Eg - and (Eg^^^)Tr is. In this 
way, if < 1, the entropy of the given cluster struc- 
ture is of the same order as the entropy of some random 
permutation of the cluster labels, so it is not relevant. 
Instead, if 0g - » I, the cluster structure is far more 
likely than a random classification of the vertices, so the 
clustering is relevant. The indicator 0g ^ can be simply 
generalized to the case of directed and weighted graphs. 

Lancichinetti et al. (Lancichinctti et at, 2009) as well 
addressed the issue by comparing the cluster structure of 
the graph with that of a random graph with similar prop- 
erties. An important novelty of this approach is the fact 
that it estimates the significance of single communities, 
not of partitions. In fact, not all communities are equally 
significant, in general, so it makes a lot of sense to check 
them individually. In particular, it may happen that real 
networks are not fully modular, due to their particular 
history or generating mechanisms, and that only portions 
of them display community structure. The main idea is 
to verify how likely it is that a community C is a sub- 
graph of a random graph with the same degree sequence 
of the original graph. This likelihood is called C-score, 
and is computed by examining the vertex w of C, with 
the lowest internal degree fc™ in C (the "worst" vertex) . 
The C-score is defined as the probability that the internal 
degree of the worst vertex in the corresponding commu- 
nity of the null model graphs is larger than or equal to 
fc™ . This probability is computed by using tools from Ex- 
treme and Order Statistics (Bcirlant et al., 2004; David 
and Nagaraja, 2003). A low value of the C-score (< 5%) 
is a strong indication that the group of vertices at study 
is a community and not the product of random fluctu- 
ations. In addition, the measure can be used to check 
whether a subgraph has an internal modular structure. 
For that, one removes the vertices of the subgraph one 
at a time, starting from the worst and proceeding in in- 
creasing order of the internal degree, and observes how 



the C-score varies at each vertex removal: sharp drops in- 
dicate the presence of dense subgraphs (Fig. 29). There- 
fore, one could think of using the C-score as ingredient of 
new clustering techniques. As we have seen, the C-score 
is based on the behavior of the vertex with lowest internal 
degree of the subgraph. Real networks are characterized 
by noise, which could strongly affect the structural rela- 
tionships between vertices and clusters. For this reason, 
relying on the properties of a single vertex to evaluate 
the significance of a subgraph could be a problem for 
applications to real networks. Lancichinetti et al. have 
shown that the C-score can be easily extended to consider 
the t vertices with lowest internal degree, with t > 1 {B- 
score) . The main limit of the C-score is the fact that its 
null model is the same as that of Newman-Girvan mod- 
ularity. According to this null model, each vertex can in 
principle be connected to any other, no matter how large 
the system is. This is however not realistic, especially for 
large graphs, where it is much more reasonable to assume 
that each vertex has its own "horizon", i.e. a subset of 
other vertices with which it can interact, which is usually 
much smaller than the whole system (see Section VI. C). 
How to define such "horizons" and, more in general, re- 
alistic null models is still an open problem. However, the 
C-score could be easily reformulated with any null model, 
so one could readily derive more reliable definitions. 

We conclude with a general issue which is related to 
the significance of community structure. The question is: 
given a cluster structure in a graph, can it be recovered 
a priori by an algorithm? In a recent paper (Rcichardt 
and Leone, 2008), Reichardt and Leone studied under 
which conditions a special built-in cluster structure can 
be recovered. The clusters have equal size and a pair of 
vertices is connected with probability p if they belong to 
the same cluster, with probability r < p otherwise. In 
computer science this is known as the planted partition- 
ing problem (Condon and Karp, 2001). The goal is to 
propose algorithms that recover the planted partition for 
any choice of p and r. For dense graphs, i. e. graphs 
whose average degree grows with the number n of ver- 
tices, algorithms can be designed that find the solution 
with a probability which equals 1 minus a term that van- 
ishes in the limit of infinite graph size, regardless of the 
difference p — r, which can then be chosen arbitrarily 
small. Since many real networks are not dense graphs, 
as their average degree (fc) is usually much smaller than 
n and does not depend on it, Reichardt and Leone inves- 
tigated the problem in the case of fixed (fc) and infinite 
graph size. We indicate with q the number of clusters and 
with Pin the probability that a randomly selected edge of 
the graph lies within any of the q clusters. In this way, 
if Pin = 1/9, the inter-cluster edge density matches the 
intra-cluster edge density (i. e. p = r), and the planted 
partition would not correspond to a recoverable cluster- 
ing, whereas for pin = 1, there are no inter-cluster edges 
and the partition can be trivially recovered. The value 
of Pin is in principle unknown, so one has to detect the 
cluster structure ignoring this information. Reichardt 
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FIG. 29 Application of the C-score by Lancichinetti et al. (Lancichinetti et al, 2009) to identify modules within subgraphs. In 
(a) the subgraph consists of a compact cluster (generated with the LFR benchmark (Lancichinetti and Fortunato, 2009; Lanci- 
chinetti et al., 2008)) plus some randomly added vertices. In (b) the subgraph consists of two compact clusters interconnected 
by a few edges. Vertices are removed from each subgraph in increasing order of their internal degree. The C-score displays sharp 
drops after all the spurious vertices (a) and all the vertices of one of the two clusters (b) are removed. We notice that the first 
subgraph (a) is not significant (high C-score) until the noise represented by the randomly added vertices disappears, whereas 
the second subgraph (b) is a community at the very beginning, as it should be, it loses significance when one of the clusters 
is heavily damaged (because the remainder of the cluster appears as noise, just like the spurious vertices in (a)), and becomes 
significant again when the damaged cluster is totally removed. Reprinted figure with permission from Ref. (Lancichinetti et al, 
2009). 



and Leone proposed to look for a minimum cut parti- 
tion, i. e. for the partition that minimizes the number of 
inter-cluster edges, as it is usually done in the graph par- 
titioning problem (discussed in Section IV. A). Clearly, 
for Pin — 1 the minimum cut partition trivially coincides 
with the planted partition, whereas for 1/q < pin < 1 
there should be some overlap, which is expected to vanish 
in the limit case Pm = l/q. The minimum cut partition 
corresponds to the minimum of the following ferromag- 
netic Potts model Hamiltonian 

'Hpart = ~ JijS<Ji,a.j, (86) 
i<j 

over the set of all spin configurations with zero magne- 
tization. Here the spin ai indicates the cluster vertex i 
belongs to, and the coupling matrix Jij is just the adja- 
cency matrix of the graph. The constraint of zero magne- 
tization ensures that the clusters have all the same size, 
as required by the planted partitioning problem. The 
energy of a spin configuration, expressed by Eq. 86, is 
the negative of the number of edges that lie within clus- 
ters: the minimum energy corresponds to the maximum 
number of intra-cluster edges, which is coupled to the 



minimum number of inter-cluster edges. The minimum 
energy can be computed with the cavity method, or be- 
lief propagation, at zero temperature (Mezard and Parisi, 
2003) . The accuracy of the solution with respect to the 
planted partition is expressed by the fraction of vertices 
which are put in the same class in both partitions. The 
analysis yields a striking result: the planted clustering is 
accurately recovered for pin larger than a critical thresh- 
old > 1/q. So, there is a range of values of pi„, 
1/q < Pin < Pinj i'^ which the clustering is not recover- 
able, as the minimum cut partition is uncorrelated with 
it. The threshold depends on the degree distribution 
p{k) of the graph. 

XV. TESTING ALGORITHMS 

When a clustering algorithm is designed, it is neces- 
sary to test its performance, and compare it with that 
of other methods. In the previous sections we have said 
very little about the performance of the algorithms, other 
than their computational complexity. Indeed, the issue 
of testing algorithms has received very little attention 
in the literature on graph clustering. This is a serious 
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limit of the field. Because of that, it is still impossible to 
state which method (or subset of methods) is the most 
reliable in applications, and people rely blindly on some 
algorithms instead of others for reasons that have noth- 
ing to do with the actual performance of the algorithms, 
like. e.g. popularity (of the method or of its inventor). 
This lack of control is also the main reason for the pro- 
liferation of graph clustering techniques in the last few 
years. Virtually in any paper, where a new method is 
introduced, the part about testing consists in applying 
the method to a small set of simple benchmark graphs, 
whose cluster structure is fairly easy to recover. Because 
of that, the freedom in the design of a clustering algo- 
rithm is basically infinite, whereas it is not clear what a 
new procedure is adding to the field, if anything. 

In this section we discuss at length the issue of testing. 
First, we describe the fundamental ingredients of any 
testing procedure, i. e. benchmark graphs with built-in 
community structure, that methods have to identify (Sec- 
tion XV. A). We proceed by reviewing measures to com- 
pare graph partitions with each other (Section XV. B). In 
Section XV. C we present the comparative evaluations of 
different methods that have been performed in the liter- 
ature. 



A. Benchmarks 

Testing an algorithm essentially means applying it to 
a specific problem whose solution is known and compar- 
ing such solution with that delivered by the algorithm. 
In the case of graph clustering, a problem with a well- 
defined solution is a graph with a clear community struc- 
ture. This concept is not trivial, however. Many cluster- 
ing algorithms are based on similar intuitive notions of 
what a community is, but different implementations. So 
it is crucial that the scientific community agrees on a 
set of reliable benchmark graphs. This mostly applies 
to computer-generated graphs, where the built-in clus- 
ter structure can be arbitrarily designed. In the liter- 
ature real networks are used as well, in those cases in 
which communities are well defined because of informa- 
tion about the system. 

We start our survey from computer-generated bench- 
marks. A special class of graphs has become quite pop- 
ular in the last years. They are generated with the 
so-called planted i-partition model (Condon and Karp, 
2001). The model partitions a graph with n = g ■ £ ver- 
tices in £ groups with g vertices each. Vertices of the 
same group are linked with a probability pin, whereas 
vertices of different groups are linked with a probability 
Pout- Each subgraph corresponding to a group is then 
a random graph a la Erdos-Renyi with connection prob- 
ability p — Pin (Section A. 3). The average degree of a 
vertex is (k) = Pi„(g- 1) -l-po«tff(^- !)• Ifpm > Pout the 
intra-cluster edge density exceeds the inter-cluster edge 
density and the graph has a community structure. This 
idea is quite intuitive and we have encountered it in sev- 



eral occasions in the previous sections. Girvan and New- 
man considered a special case of the planted ^-partition 
model (Girvan and Newman, 2002). They set £ = 4, 
g — 32 (so the number of graph vertices is n = 128) and 
fixed the average total degree (fc) to 16. This implies 
that Pin + ipout ~ 1/2, so the probabilities pin and Pout 
are not independent parameters. In calculation it is com- 
mon to use as parameters = Pin{g ~ 1) = 31pi„ and 
Zout = Poutg{C- - 1) = 96po«t, indicating the expected 
internal and external degree of a vertex, respectively. 
These particular graphs have by now gained the sta- 
tus of standard benchmarks (Girvan and Newman, 2002) 
(Fig. 30). In the first applications of the graphs one as- 
sumed that communities are well defined when Zout < 8, 
corresponding to the situation in which the internal de- 
gree exceeds the external degree. However, the thresh- 
old Zout = Zin = ?> implies p^n w 1/4 and pout = 1/12, 
so it is not the actual threshold of the model, where 
Pin = Pout = 1/8, corresponding to Zout ~ 12. So, one 
expects'^'^ to be able to detect the planted partition up 
until Zout « 12. 

Testing a method against the Girvan-Newman bench- 
mark consists in calculating the similarity between the 
partitions determined by the method and the natural 
partition of the graph in the four equal-sized groups. Sev- 
eral measures of partitions' similarity may be adopted; we 
describe them in Section XV.B. One usually builds many 
graph realizations for a particular value of Zout and com- 
putes the average similarity between the solutions of the 
method and the built-in solution. The procedure is then 
iterated for different values of Zout- The results are usu- 
ally represented in a plot, where the average similarity 
is drawn as a function of Zout- Most algorithms usu- 
ally do a good job for small Zout and start to fail when 
Zout approaches 8. Fan et al. (Fan et at, 2007) have 
designed a weighted version of the benchmark of Gir- 
van and Newman, in that one gives different weights to 
edges inside and between communities. One could pick 
just two values, one for intra- and the other for inter- 
community edges, or uniformly distributed values in two 
different ranges. For this benchmark there are then two 
parameters that can be varied: Zout and the relative im- 
portance of the internal and the external weights. Typ- 
ically one fixes the topological structure and varies the 
weights. This is particularly insightful when Zout — 4, 
which delivers graphs without topological cluster struc- 
ture: in this case, the question whether there are clusters 
or not depends entirely on the weights. 

As we have remarked above, the planted ^-partition 
model generates mutually interconnected random graphs 



However, we stress that, even if communities are there, meth- 
ods may be unable to detect them. The reason is that, due to 
fluctuations in the distribution of links in the graphs, already 
before the limit imposed by the planted partition model it may 
be impossible to detect the communities and the model graphs 
may look similar to random graphs. 
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FIG. 30 Benchmark of Girvan and Newman. The three pictures correspond to = 15 (a), Zi^ = 11 (b) and Zin = 8 (c). In 
(c) the four groups are hardly visible. Reprinted figure with permission from Ref. (Guimera and Amaral, 2005). ©2005 by the 
Nature Publishing Group. 



a la Erdos-Renyi. Therefore, all vertices have approxi- 
mately the same degree. Moreover, all communities have 
exactly the same size by construction. These two features 
are at odds with what is observed in graph representa- 
tions of real systems. Degree distributions are usually 
skewed, with many vertices with low degree coexisting 
with a few vertices with high degree. A similar hetero- 
geneity is also observed in the distribution of cluster sizes, 
as we shall see in Section XVI. So, the planted i?-partition 
model is not a good description of a real graph with com- 
munity structure. However, the model can be modified to 
account for the heterogeneity of degrees and community 
sizes. A modified version of the model, called Gaussian 
random partition generator, was designed by Brandes et 
al. (Brandes et ai, 2003). Here the cluster sizes have a 
Gaussian distribution, so they are not the same, although 
they do not differ much from each other. The hetero- 
geneity of the cluster sizes introduces a heterogeneity in 
the degree distribution as well, as the expected degree of 
a vertex depends on the number of vertices of its clus- 
ter. Still, the variability of degree and cluster size is not 
appreciable. Besides, vertices of the same cluster keep 
having approximately the same degree. A better job in 
this direction has been recently done by Lancichinetti et 
al. (LFR benchmark) (Lancichinetti et ai, 2008). They 
assume that the distributions of degree and community 
size are power laws, with exponents ti and r2, respec- 
tively. Each vertex shares a fraction 1 — /x of its edges 
with the other vertices of its community and a fraction fi 
with the vertices of the other communities; < ii < 1 is 
the mixing parameter. The graphs are built as follows: 

1. A sequence of community sizes obeying the pre- 
scribed power law distribution is extracted. This is 
done by picking random numbers from a power law 
distribution with exponent T2. 

2. Each vertex i of a community receives an internal 



degree (1 — /i)fcj, where ki is the degree of vertex i, 
which is taken by a power law distribution with ex- 
ponent Ti . In this way, each vertex i has a number 
of stubs (1 — n)ki. 

3. All stubs of vertices of the same community are ran- 
domly attached to each other, until no more stubs 
are "free" . In this way the sequence of internal de- 
grees of each vertex in its community is maintained. 

4. Each vertex i receives now an additional number of 
stubs, equal to /ifc^ (so that the final degree of the 
vertex is fc^), that are randomly attached to ver- 
tices of different communities, until no more stub 
is "free". 

Numerical tests show that this procedure has a com- 
plexity 0{m), where m is as usual the number of edges 
of the graph, so it can be used to create graphs of sizes 
spanning several orders of magnitude. Fig. 31 shows an 
example of a LFR benchmark graph. Recently the LFR 
benchmark has been extended to directed and weighted 
graphs with overlapping communities (Lancichinetti 
and Fortunato, 2009). The software to create the 
LFR benchmark graphs can be freely downloaded at 
http : // Scinto .fortunato .googlepages . com/ inthepre 
ss2. 

A class of benchmark graphs with power law de- 
gree distributions had been previously introduced by 
Bagrow (Bagrow, 2008). The construction process starts 
from a graph with a power-law degree distribution. 
Bagrow used Barabasi- Albert scale free graphs (Barabasi 
and Albert, 1999). Then, vertices are randomly assigned 
to one of four equally-sized communities. Finally, pairs 
of edges between two communities are rewired so that 
either edge ends up within the same community, with- 
out altering the degree sequence (on average). This is 
straightforward: suppose that the edges join the vertex 
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FIG. 31 A realization of the LFR benchmark graphs (Lancichinetti et al, 2008), with 500 vertices. The distributions of the 
vertex degree and of the community size are both power laws. Such benchmark is a more faithful approximation of real-world 
networks with community structure than simpler benchmarks like, e. g., that by Girvan and Newman (Girvan and Newman, 
2002). Reprinted figure with permission from Ref. (Lancichinetti et al, 2008). ©2008 by the American Physical Society. 



pairs ai, bi and 02, &2, where ai, 02 belong to commu- 
nity A and 61, 62 to community B. It the edges are 
replaced by ai-a2 and 61-62 (provided they do not ex- 
ist already), all vertices keep their degrees. With this 
rewiring procedure one can arbitrarily vary the edge den- 
sity within and, consequently, between clusters. In this 
class of benchmarks, however, communities are all of the 
same size by construction, although one can in principle 
relax this condition. 

A (seemingly) different benchmark is represented by 
the class of relaxed caveman graphs, which were origi- 
nally introduced to explain the clustering properties of 
social networks (Watts, 2003). The starting point is a 
set of disconnected cliques. With some probability edges 
are rewired to link different cliques. Such model graphs 
are interesting as they are smooth variations of the ideal 
graph with "perfect" communities, i. e. disconnected 
cliques. On the other hand the model is equivalent to 
the planted i!-partition model, where pi„ = 1 — p and 
Pout is proportional to p, with coefficient depending on 
the size of the clusters. 

Benchmark graphs have also been introduced to deal 
with special types of graphs and/or cluster structures. 
For instance, Arenas et al. (Arenas et ai, 2006) have 
introduced a class of benchmark graphs with embedded 



hierarchical structure, which extends the class of graphs 
by Girvan and Newman. Here there are 256 vertices and 
two hierarchical levels, corresponding to a partition in 16 
groups (microcommunities) with 16 vertices and a par- 
tition in 4 larger groups of 64 vertices (macrocommu- 
nities), comprising each 4 of the smaller groups. The 
edge densities within and between the clusters are indi- 
cated by three parameters Zm^, Zi„2 and Zout- Zin^ is 
the expected internal degree of a vertex within its micro- 
community; Zi„2 is the expected number of edges that the 
vertex shares with the vertices of the other microcommu- 
nities within its macrocommunity; Zout is the expected 
number of edges connecting the vertex with vertices of 
the other three macrocommunities. The average degree 
{k) = Zim + + '^out of a vcrtcx is fixed to 18. Fig. 7 
shows an example of hierarchical graph constructed based 
on the same principle, with 512 vertices and an average 
degree of 32. 

Guimera et al. (Guimcra et al., 2007) have proposed 
a model of bipartite graphs with built-in communities. 
They considered a bipartite graph of actors and teams, 
here we describe how to build the benchmarks for gen- 
eral bipartite graphs. One starts from a bipartite graph 
whose vertex classes A and B are partitioned into 
groups, and C,f [i = 1,2, ...,nc). Each cluster d 
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comprises all vertices of the subgroups and Cf , re- 
spectively. With probability p edges are placed between 
vertices of subgroups Cf and Cf {i = 1, 2, n^), i. e. 
within clusters. With probability 1 — p, edges are placed 
between vertices of subgroups Cf and , where i and j 
are chosen at random, so they can be equal or different. 
By construction, a non-zero value of the probability p 
indicates a preference by vertices to share links with ver- 
tices of the same cluster, i. e. for p > the graph has a 
built-in community structure. For p = 1 there would be 
edges only within clusters, i. e. the graph has a perfect 
cluster structure. 

Finally, Sawardecker et al. introduced a general 
model, that accounts for the possibility that clusters over- 
lap (Sawardecker et al., 2009). The model is based on the 
reasonable assumption that the probability pij that two 
vertices are connected by an edge grows with the number 
no of communities both vertices belong to. For vertices 
in different clusters, pij = po, if they are in the same 
cluster (and only in that one) pij — pi, if they belong 
to the same two clusters Pij — P2, etc.. By hypothesis, 
Po < Pi ^ P2 ^ P3---- The planted ^-partition model is 
recovered when pi = P2 = Pa---- 

As we have seen, nearly all existing benchmark graphs 
are inspired by the planted ^-partition model, to some 
extent. However, the model needs to be refined to pro- 
vide a good description of real graphs with community 
structure. The hypothesis that the linking probabilities 
of each vertex with the vertices of its community or of 
the other communities are constant is not realistic. It is 
more plausible that each pair of vertices i and j has its 
own linking probability pij, and that such probabilities 
are correlated for vertices in the same cluster. 

Tests on real networks usually focus on a very limited 
number of examples, for which one has precise informa- 
tion about the vertices and their properties. 

In Section II we have introduced two popular real net- 
works with known community structure, i. e. the social 
network of Zachary's karate club and the social network 
of bottlenose dolphins living in Doubtful Sound (New 
Zealand), studied by Lusseau. Here, the question is 
whether the actual separation in two social groups could 
be predicted from the graph topology. Zachary's karate 
club is by far the most investigated system. Several algo- 
rithms are actually able to identify the two classes, mod- 
ulo a few intermediate vertices, which may be misclassi- 
fied. Other methods are less successful: for instance, the 
maximum of Newman-Girvan modularity corresponds to 
a split of the network in four groups (Donctti and Muiioz, 
2004; Duch and Arenas, 2005). Another well known ex- 
ample is the network of American college football teams, 
derived by Girvan and Newman (Girvan and Newman, 
2002). There are 115 vertices, representing the teams, 
and two vertices are connected if their teams play against 
each other. The teams are divided into 12 conferences. 
Games between teams in the same conference are more 
frequent than games between teams of different confer- 
ences, so one has a natural partition where the commu- 



nities correspond to the conferences. 

When dealing with real networks, it is useful to re- 
solve their community structure with different clustering 
techniques, to cross-check the results and make sure that 
they are consistent with each other, as in some cases the 
answer may strongly depend on the specific algorithm 
adopted. However, one has to keep in mind that there 
is no guarantee that "reasonable" communities, defined 
on the basis of non-structural information, must coincide 
with those detected by methods based only on the graph 
structure. 



B. Comparing partitions: measures 

Checking the performance of an algorithm involves 
defining a criterion to establish how "similar" the par- 
tition delivered by the algorithm is to the partition one 
wishes to recover. Several measures for the similarity of 
partitions exist. In this section we present and discuss 
the most popular measures. A thorough introduction of 
similarity measures for graph partitions has been given 
by Meila (Meila, 2007) and we will follow it closely. 

Let us consider two generic partitions X = 
{Xi,X2,---,Xnx) and y = (Yi, 12, ) of a graph 

Q, with nx and ny clusters, respectively. We indicate 
with n the number of graph vertices, with and nj 
the number of vertices in clusters Xi and Yj and with 
Uij the number of vertices shared by clusters Xi and Yj : 
n,, = \X,r\Yj\- 

In the first tests using the benchmark graphs by Girvan 
and Newman (Section XV. A) scholars used a measure 
proposed by Girvan and Newman themselves, the frac- 
tion of correctly classified vertices- A vertex is correctly 
classified if it is in the same cluster with at least half of 
its "natural" partners. If the partition found by the al- 
gorithm has clusters given by the merging of two or more 
natural groups, all vertices of the cluster are considered 
incorrectly classified. The number of correctly classified 
vertices is then divided by the total size of the graph, 
to yield a number between and 1. The recipe to label 
vertices as correctly or incorrectly classified is somewhat 
arbitrary, though. 

Apart from the fraction of correctly classified vertices, 
which is somewhat ad hoc and distinguishes the roles of 
the natural partition and of the algorithm's partition, 
most similarity measures can be divided in three cate- 
gories: measures based on pair counting, cluster match- 
ing and information theory. 

Measures based on pair counting depend on the num- 
ber of pairs of vertices which are classified in the same 
(different) clusters in the two partitions. In particular 
flu indicates the number of pairs of vertices which are 
in the same community in both partitions, oqi (aio) the 
number of pairs of elements which are put in the same 
community in X (y) and in different communities in y 
(X) and floo the number of pairs of vertices that are in 
different communities in both partitions. Wallace (Wal- 
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lace, 1983) proposed the two indices 



Wi 



ail 



E.--f(-f-i)/2' 



Wii 



ail 



(87) 

Wi and Wii represent the probabihty that vertex pairs 
in the same cluster of X are also in the same cluster for 
3^, and viceversa. These indices are asymmetrical, as the 
role of the two partitions is not the same. Fowlkes and 
Mallows (Fowlkes and Mallows, 1983) suggested to use 
the geometric mean of Wi and Wn, which is symmetric. 

The Rand index (Rand, 1971) is the ratio of the number 
of vertex pairs correctly classified in both partitions (i. e. 
either in the same or in different clusters), by the total 
number of pairs 



Oil + floo 



ail + floi + flio + aoo 



A measure equivalent to the Rand index is the Mirkin 
metric (Mirkin, 1996) 

MiX,y) = 2(aoi + aio) = n(n- 1)[1 - RiX,y)]. (89) 

The Jaccard index is the ratio of the number of vertex 
pairs classified in the same cluster in both partitions, by 
the number of vertex pairs which are classified in the 
same cluster in at least one partition, i. e. 



an 



ail + aoi + aiQ 



(90) 



Adjusted versions of both the Rand and the Jaccard in- 
dex exist, in that a null model is introduced, correspond- 
ing to the hypothesis of independence of the two parti- 
tions (Meila, 2007). The null model expectation value 
of the measure is subtracted from the unadjusted ver- 
sion, and the result is normalized by the range of this 
difference, yielding 1 for identical partitions and as ex- 
pected value for independent partitions (negative values 
are possible as well). Unadjusted indices have the draw- 
back that they are not local, i. e. the result depends 
on how the whole graph is partitioned, even when the 
partitions differ only in a small region of the graph. 

Similarity measures based on cluster matching aim at 
finding the largest overlaps between pairs of clusters of 
different partitions. For instance, the classification error 
H{X,y) is defined as (Mcila and Heckerman, 2001) 



H{x,y) = i 



-max^nfc^(fc), 
fc=i 



n 



(91) 



where tt is an injective mapping from the cluster indices 
of partition y to the cluster indices of partition X . The 
maximum is taken over all possible injections {tt}. In 
this way one recovers the maximum overlap between the 
clusters of the two partitions. An alternative measure 
is the normalized Van Dongen metric, defined as (van 
Dongcn, 2000b) 



D{X,y) = l- 



1 

2n 



E 



max Ukk' 

k' 



maxrifcfc' 



(92) 



A common problem of this type of measures is that some 
clusters may not be taken into account, if their overlap 
with clusters of the other partition is not large enough. 
Therefore if we compute the similarity between two parti- 
tions X and X' and partition y, with X and X' differing 
from each other by a different subdivision of parts of the 
graph that are not used to compute the measure, one 
would obtain the same score. 

The third class of similarity measures is based on re- 
formulating the problem of comparing partitions as a 
problem of message decoding within the framework of 
information theory (Mackay, 2003). The idea is that, if 
two partitions are similar, one needs very little informa- 
tion to infer one partition given the other. This extra 
information can be used as a measure of dissimilarity. 
To evaluate the Shannon information content (Mackay, 
2003) of a partition, one starts by considering the com- 
munity assignments {xi} and {jji}, where Xi and yi in- 
dicate the cluster labels of vertex i in partition X and 
y, respectively. One assumes that the labels x and y 
are values of two random variables X and Y , with joint 
distribution P{x,y) = P{X = x,Y — y) — n^y/n, 
which implies that P{x) ~ P{X ~ x) — /n and 
P{y) = P(Y = y) = Uy jn. The mutual information 
I{X, Y) of two random variables has been previously 
defined [Eq. (70)], and can be applied as well to par- 
titions X and 3^, since they are described by random 
variables. Actually /(X, F) = H{X) - H{X\Y), where 
H{X) — — Ea; -^l^) logi^(a;) is the Shannon entropy of 
X and H{X\Y) = -Y.,:^y P{x,y) log P{x\y) is the con- 
ditional entropy of X given Y . The mutual information 
is not ideal as a similarity measure: in fact, given a par- 
tition X , all partitions derived from X by further par- 
titioning (some of) its clusters would all have the same 
mutual information with X, even though they could be 
very different from each other. In this case the mutual in- 
formation would simply equal the entropy H(X), because 
the conditional entropy would be systematically zero. To 
avoid that, Danon et al. adopted the normalized mutual 
information (Danon et al, 2005) 



.ix,y) 



^HX,Y) 
H{X) + H{Yy 



(93) 



which is currently very often used in tests of graph clus- 
tering algorithms. The normalized mutual information 
equals 1 if the partitions are identical, whereas it has 
an expected value of if the partitions are independent. 
The measure, defined for standard partitions, in which 
each vertex belongs to only one cluster, has been recently 
extended to the case of overlapping clusters by Lanci- 
chinetti et al. (Lancichinctti et al., 2009). The extension 
is not straightforward as the community assignments of 
a partition are now specified by a vectorial random vari- 
able, since each vertex may belong to more clusters si- 
multaneously. In fact, the definition by Lancichinctti et 
al. is not a proper extension of the normalized mutual 
information, in the sense that it does not recover exactly 
the same value of the original measure for the compar- 
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ison of proper partitions without overlap, even though 
the values are close. 

Meila (Meila, 2007) introduced the variation of infor- 
mation 



V{X,y)=H{X\Y)+H{Y\X), 



(94) 



which has some desirable properties with respect to the 
normalized mutual information and other measures. In 
particular, it defines a metric in the space of partitions 
as it has the properties of distance. It is also a local 
measure, i. e. the similarity of partitions differing only 
in a small portion of a graph depends on the differences of 
the clusters in that region, and not on the partition of the 
rest of the graph. The maximum value of the variation 
of information is log n, so similarity values for partitions 
of graphs with different size cannot be compared with 
each other. For meaningful comparisons one could divide 
V{X,y) by logn, as suggested by Karrer et al. (Karrcr 
et al, 2008). 

A concept related to similarity is that of distance, 
which indicates basically how many operations need to 
be performed in order to transform a partition to an- 
other. Gustafsson et al. defined two distance measures 
for partitions (Gustafsson et al., 2006). They are both 
based on the concept of meet of two partitions, which is 
defined as 



riA riB 



i=i]=i 



(95) 



The distance In both 

cases they are determined by summing the distances of 
X and y from the meet Ai. For m^noved the distance of 
X [y) from the meet is the minimum number of elements 
that must be moved between X and y so that X (y) and 
M coincide (Gusfield, 2002). For rudiy the distance of X 
{y) from the meet is the minimum number of divisions 
that must be done in X {y) so that X (y) and M. coin- 
cide (Stanley, 1997). Such distance measures can easily 
be transformed in similarity measures, like 



1 



Idiv = 1 - rudiv/n. (96) 



Identical partitions have zero mutual distance and simi- 
larity 1 based on Eqs. 96. 

Finally an important problem is how to define the sim- 
ilarity between clusters. If two partitions X and 3^ of a 
graph are similar, each cluster of X will be very similar 
to one cluster of y, and viceversa, and it is important 
to identify the pairs of corresponding clusters. For in- 
stance, if one has information about the time evolution 
of a graph, one could monitor the dynamics of single clus- 
ters as well, by keeping track of each cluster at different 
time steps (Palla et al., 2007). Given clusters Xi and 
Yj, their similarity can be defined through the relative 
overlap Sij 
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(97) 



, 0.8 



0.6 



o 0.4 



0.2 



1 



I 



EL 



il 



Method 



FIG. 32 Relative performances of the algorithms listed in 
Table I on the Girvan-Newman benchmark, for three values 
of the expected average external degree Zout ■ Reprinted figure 
with permission from Ref. (Danon et al., 2005). ©2005 by 
lOP Pubhshing and SISSA. 



In this way, looking for the cluster of y corresponding to 
Xi means finding the cluster Yj that maximizes Sij . The 
index Sij can be used to define similarity measures for 
partitions as well (Fan et al, 2007; Zhang et al., 2006). 
An interesting discussion on the problem of comparing 
partitions, along with further definitions of similarity 
measures not discussed here, can be found in Ref. (Traud 
et al, 2008). 



C. Comparing algorithms 

The first systematic comparative analysis of graph 
clustering techniques has been carried out by Danon et 
al. (Danon et al., 2005). They compared the perfor- 
mances of various algorithms on the benchmark graphs 
by Girvan and Newman (Section XV. A). The algorithms 
examined are listed in Table I, along with their complex- 
ity. Fig. 32 shows the performance of all algorithms. 
Instead of showing the whole curves of the similarity ver- 
sus Zout (Section XV. A), which would display a fuzzy 
picture with many strongly overlapping curves, difficult 
to appreciate, Danon et al. considered three values for 
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Author 


Ref. 


Label 


Order 


Eckmann & Moses 


(Eckmann and Moses, 2002) 


EM 


0(m{/fc2)) 


Zhou & Lipowsky 


(Zhou and Lipowsky, 2004) 


ZL 


O(n^) 


Latapy & Pons 


(Latapy and Pons, 2005) 


LP 


O(n^) 


Clauset et al. 


(Clauset et al, 2004) 


NF 


0(n log^ n) 


Newman & Girvan 


(Newman and Girvan, 2004) 


NG 




Girvan & Newman 


(Girvan and Newman, 2002) 


GN 


0{n'^m) 


Guimera et al. 


(Guimera and Amaral, 2005; Guimera et al., 2004) 


SA 


parameter dependent 


Duch & Arenas 


(Duch and Arenas, 2005) 


DA 


0(n^ logn) 


Fortunato et al. 


(Fortunato et al, 2004) 


FLM 


0{m^n) 


Radicchi et al. 


(Radicchi et al, 2004) 


RCCLP 




Donetti & Muhoz 


(Donetti and Mufioz, 2004, 2005) 


DM/DMN 


0(r?) 


Bagrow & BoUt 


(Bagrow and BoUt, 2005) 


BB 


0(r?) 


Capocci et al. 


(Capocci et al, 2005) 


cscc 


0{r?) 


Wu & Huberman 


(Wu and Huberman, 2004) 


WH 


0(n + m) 


Palla et al. 


(Palla et al, 2005) 


PK 


C)(exp(n)) 


Reichardt & Bornholdt 


(Reichardt and Bornholdt, 2004) 


RB 


parameter dependent 



TABLE I List of the algorithms used in the comparative analysis of Danon et al. (Danon et al, 2005). The first column 
indicates the names of the algorithm designers, the second the original reference of the work, the third the symbol used to 
indicate the algorithm and the last the computational complexity of the technique. Adapted from Ref. (Danon et al, 2005). 



Author 


Ref. 


Label 


Order 


Girvan & Newman 


(Girvan and Newman, 2002; Newman and Girvan, 2004) 


GN 


0(nm2) 


Clauset et al. 


(Clauset et al, 2004) 


Clauset et al. 


0{n log^ n) 


Blondel et al. 


(Blondel et al, 2008) 


Blondel et al. 


0{m) 


Guimera et al. 


(Guimera and Amaral, 2005; Guimera et al, 2004) 


Sim. Arm. 


parameter dependent 


Radicchi et al. 


(Radicchi et al, 2004) 


Radicchi et al. 


0(mVn2) 


Palla et al. 


(PaUa et al, 2005) 


Cfinder 


0(exp(n)) 


Van Dongen 


(Dongen, 2000a) 


MCL 


0(nk^), k < n parameter 


Rosvall & Bergstrom 


(Rosvall and Bergstrom, 2007) 


Infomod 


parameter dependent 


Rosvall & Bergstrom 


(Rosvall and Bergstrom, 2008) 


Infomap 


0(m) 


Donetti & Muhoz 


(Donetti and Munoz, 2004, 2005) 


DM 




Newman & Leicht 


(Newman and Leicht, 2007) 


EM 


parameter dependent 


Ronhovde & Nussinov 


(Ronhovde and Nussinov, 2009) 


RN 


©(m^logn), P ~ 1.3 



TABLE II List of the algorithms used in the comparative analysis of Lancichinetti and Fortunato (Lancichinetti and Fortunato, 
2009). The first column indicates the names of the algorithm designers, the second the original reference of the work, the third 
the symbol used to indicate the algorithm and the last the computational complexity of the technique. 



Zout (6, 7 and 8), and represented the result for each algo- 
rithm as a group of three columns, indicating the average 
value of the similarity between the planted partition and 
the partition found by the method for each of the three 
^out-values. The similarity was measured in terms of the 
fraction of correctly classified vertices (Section XV. A). 
The comparison shows that modularity optimization via 
simulated annealing (Section VI. A. 2) yields the best re- 
sults, although it is a rather slow procedure, that cannot 
be applied to graphs of size of the order of 10^ vertices or 
larger. On the other hand, we have already pointed out 
that the benchmark by Girvan and Newman is not a good 
representation of real graphs with community structure, 
which are characterized by heterogeneous distributions 



of degree and community sizes. In this respect, the class 
of graphs designed by Lancichinetti et al. (LFR bench- 
mark) (Lancichinetti et al, 2008) (Section XV. A) poses 
a far more severe test to clustering techniques. For in- 
stance, many methods have problems to detect clusters 
of very different sizes (like most methods listed in Ta- 
ble I). For this reason, Lancichinetti and Fortunato have 
carried out a careful comparative analysis of community 
detection methods on the much more restrictive LFR 
benchmark (Lancichinetti and Fortunato, 2009). The al- 
gorithms chosen are listed in Table II. In Fig. 33 the per- 
formances of the algorithms on the LFR benchmark are 
compared. Whenever possible, tests on the versions of 
the LFR benchmark with directed edges, weighted edges 



and/or overlapping communities (Lancichinctti and For- 
tunato, 2009) were carried out. Lancichinetti and For- 
tunato also tested the methods on random graphs, to 
check whether they are able to notice the absence of com- 
munity structure. From the results of all tests, the In- 
fomap method by Rosvall and Bergstrom (Rosvall and 
Bcrgstrom, 2008) appears to be the best, but also the 
algorithms by Blondel et al. (Blondel et ai, 2008) and by 
Ronhovde and Nussinov (Ronhovde and Nussinov, 2009) 
have a good performance. These three methods are also 
very fast, with a complexity which is essentially linear in 
the system size, so they can be applied to large systems. 
On the other hand, modularity-based methods (with the 
exception of the method by Blondel et al.) have a rather 
poor performance, which worsens for larger systems and 
smaller communities, due to the well known resolution 
limit of modularity (Fortunato and Barthclemy, 2007). 
The performance of the remaining methods worsens con- 
siderably if one increases the system size (DM and Info- 
mod) or the community size (Cfinder, MCL and method 
by Radicchi et al.). 

Fan et al. have evaluated the performance of some al- 
gorithms to detect communities on weighted graphs (Fan 
et ai, 2007). The algorithms are: modularity maxi- 
mization, carried out with extremal optimization (WEO) 
(Section VI.A.3); the Girvan-Newman algorithm (WGN) 
(Section V.A); the Potts model algorithm by Reichardt 
and Bornholdt (Potts) (Section VIII. A). All these tech- 
niques have been originally introduced for unweighted 
graphs, but we have shown that they can easily be ex- 
tended to weighted graphs. The algorithms were tested 
on the weighted version of the benchmark of Girvan 
and Newman, that we discussed in Section XV. A. Edge 
weights have only two values: Winter for inter-cluster 
edges and Wintra for intra-cluster edges. Such values are 
linked by the relation w intra + Winter = 2, so they are 
not independent. For testing one uses realizations of the 
benchmark with fixed topology (i. e. fixed Zout) and vari- 
able weights. In Fig. 34 the comparative performance of 
the three algorithms is illustrated. The topology of the 
benchmark graphs corresponds to Zout = 8, i. e. to graphs 
in which each vertex has approximately the same num- 
ber of neighbors inside and outside its community. By 
varying Winter from to 2 one goes smoothly from a sit- 
uation in which most of the weight is concentrated inside 
the clusters, to a situation in which instead the weight is 
concentrated between the clusters. From Fig. 34 we see 
that WEO and Potts are more reliable methods. 

Sawardecker et al. have tested methods to detect 
overlapping communities (Sawardecker et ai, 2009). 
They considered three algorithms: modularity opti- 
mization, the Clique Percolation Method (CPM) (Sec- 
tion XI. A) and the modularity landscape surveying 
method by Sales-Pardo et al. (Sales-Pardo et at, 2007) 
(Section XII. B). For testing, Sawardecker et al. defined 
a class of benchmark graphs in which the linking prob- 
ability between vertices is an increasing function of the 
number of clusters the vertices belong to. We have de- 
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Mixing parameter 



FIG. 33 Performances of several algorithms on the LFR 
benchmark (Lancichinetti and Fortunato, 2009). The plots 
show the normalized mutual information (in the version pro- 
posed in Ref. (Lancichinetti et al., 2009)) as a function of the 
mixing parameter of the benchmark graphs. The different 
curves for each method refer to different system sizes (1000 
and 5000 vertices) and community size ranges [(S)=from 10 
to 50 vertices, (B)=from 20 to 100 vertices]. For the GN al- 
gorithm only the smaller graph size was adopted, due to the 
high complexity of the method, whereas for the EM method 
there are eight curves instead of four because for each set of 
benchmark graphs the algorithm was run starting from two 
different initial conditions. Reprinted figure with permission 
from Ref. (Lancichinctti and Fortunato, 2009). ©2009 by the 
American Physical Society. 
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FIG. 34 Comparative evaluation of the performances of al- 
gorithms to find communities in weighted graphs. Tests are 
carried out on a weighted version of the benchmark of Girvan 
and Newman. The two plots show how good the algorithms 
are in terms of the precision and accuracy with which they 
recover the planted partition of the benchmark. Precision in- 
dicates how close the values of similarity between the planted 
and the model partition are after repeated experiments with 
the same set of parameters; accuracy indicates how close the 
similarity values are to the ideal result (1) after repeated ex- 
periments with the same set of parameters. The similarity 
measure adopted here is based on the relative overlap of clus- 
ters of Eq. 97. We see that the maximization of modularity 
with extremal optimization (WEO) and the Potts model al- 
gorithm (Potts) are both precise and accurate as long as the 
weight of the inter-cluster edges Winter remains lower than 
the weight of the intra-cluster edges {winter < 1). Reprinted 
figures with permission from Ref. (Fan et al, 2007). ©2007 
by Elsevier. 



scribed this benchmark in Section XV. A. It turns out 
that the modularity landscape surveying method is able 
to identify overlaps between communities, as long as the 
fraction of overlapping vertices is small. Curiously, the 
CPM, designed to find overlapping communities, has a 
poor performance, as the overlapping vertices found by 
the algorithm are in general different from the overlap- 
ping vertices of the planted partition of the benchmark. 
The authors also remark that, if the overlap between two 
clusters is not too small, it may be hard (for any method) 
to recognize whether the clusters are overlapping or hi- 
erarchically organized, i. e. loosely connected clusters 
within a large cluster. 

We close the section with some general remarks con- 
cerning testing. We have seen that a testing procedure 
requires two crucial ingredients: benchmark graphs with 
built-in community structure and clustering algorithms 
that try to recover it. Such two elements are not inde- 
pendent, however, as they are both based on the concept 
of community. If the underlying notions of community 
for the benchmark and the algorithm are very different, 
one can hardly expect that the algorithm will do a good 
job on the benchmark. Furthermore, there is a third el- 
ement, i. e. the quality of a partition. All benchmarks 
start from a situation in which communities are clearly 
identified, i. e. connected components of the graph, and 
introduce some amount of noise, that eventually leads 
to a scenario where clusters are hardly or no longer de- 
tectable. It is then important to keep track of how the 
quality of the natural partition of the benchmark worsens 
as the amount of noise increases, in order to distinguish 
configurations in which the graphs have a cluster struc- 
ture, that an algorithm should then be able to resolve, 
from configurations in which the noise prevails and the 
natural clusters are not meaningful. Moreover, quality 
functions are important to evaluate the performance of 
an algorithm on graphs whose community structure is 
unknown. Quality functions are strongly related to the 
concept of community as well, as they are supposed to 
evaluate the goodness of the clusters, so they require a 
clear quantitative concept of what a cluster is. It is very 
important for any testing framework to check for the mu- 
tual dependencies between the benchmark, the quality 
function used to evaluate partitions, and the clustering 
algorithm to be tested. This issue has so far received very 
little attention (Delling et al., 2007). Finally, empirical 
tests are also very important, as one ultimately wishes to 
apply clustering techniques to real graphs. Therefore, it 
is crucial to collect more data sets of graphs whose com- 
munity structure is known or deducible from information 
on the vertices and their edges. 



XVI. GENERAL PROPERTIES OF REAL CLUSTERS 



What are the general properties of partitions and clus- 
ters of real graphs? In many papers on graph clustering 
applications to real systems are presented. In spite of the 
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FIG. 35 Cumulative distribution of community sizes for the 
Amazon purchasing network. The partition is derived by 
greedy modularity optimization. Reprinted figure with per- 
mission from Ref. (Clauset et ai, 2004). ©2004 by the Amer- 
ican Physical Society. 



variety of clustering methods that one could employ, in 
many cases partitions derived from different techniques 
are rather similar to each other, so the general properties 
of clusters do not depend much on the particular algo- 
rithm used. The analysis of clusters and their properties 
delivers a mesoscopic description of the graph, where the 
communities, and not the vertices, are the elementary 
units of the topology. The term mesoscopic is used be- 
cause the relevant scale here lies between the scale of the 
vertices and that of the full graph. 

One of the first issues addressed was whether the com- 
munities of a graph are usually about of the same size or 
whether the community sizes have some special distribu- 
tion. Most clustering techniques consistently find skewed 
distributions of community sizes, with a tail described 
with good approximation by a power law (at least, a 
sizeable portion of the curve) with exponents in the range 
between 1 and 3 (Clauset et ai, 2004; Danon et al., 2007; 
Newman, 2004a; Palla et ai, 2005; Radicchi et al, 2004). 
So, there seems to be no characteristic size for a commu- 
nity: small communities usually coexist with large ones. 
As an example. Fig. 35 shows the cumulative distribution 
of community sizes for a recommendation network of the 
online vendor Amazon.com. Vertices are products and 
there is a connection between item A and B \i B was 
frequently purchased by buyers of A. Recall that the 
cumulative distribution is the integral of the probability 
distribution: if the cumulative distribution is a power law 
s~" , the probability distribution is also a power law with 
exponent —(a -I- 1). 

Leskovec et al. (Leskovec et al., 2008) have gone one 
step further. They carried out a systematic analysis of 
communities in large real networks, including traditional 
and on-line social networks, technological, information 



networks and web graphs. The main goal was to assess 
the quality of communities at various sizes. As a qual- 
ity function the conductance of the cluster was chosen. 
We remind that the conductance of a cluster is the ratio 
between the cut size of the cluster and the minimum be- 
tween the total degree of the cluster and that of the rest 
of the graph (Section IV. A). So, if the cluster is much 
smaller than the whole graph, the conductance equals 
the ratio between the cut size and the total degree of 
the cluster. Since a "good" cluster is characterized by 
a low cut size and a large internal density of edges, low 
values of the conductance indicate good clusters. For 
each real network Leskovec et al. derived the network 
community profile plot (NCPP), showing the minimum 
conductance score among subgraphs of a given size as a 
function of the size. Interestingly, they found that the 
NCPPs of all networks they studied have a characteris- 
tic shape: they go downwards up until subgraphs with 
about 100 vertices, and then they rise monotonically for 
larger subgraphs (Fig. 36). This seems to suggest that 
communities are well defined only when they are fairly 
small in size. Such small clusters are weakly connected 
to the rest of the graph, often by a single edge (in this 
case they are called whiskers) , and form the periphery of 
the network. The other vertices form a big core, in which 
larger clusters are well connected to each other, and are 
therefore barely distinguishable (Fig. 36). Leskovec et 
al. performed low-conductance cuts with several meth- 
ods, to ensure that the result is not a simple artefact of 
a particular chosen technique. Moreover, they have also 
verified that, for large real networks with known com- 
munity structure (such as, e.g., the social network of the 
on-line blogging site LiveJoumal, with its user groups), 
the NCPP has the same qualitative shape if one takes the 
real communities instead of low-conductance subgraphs. 
The analysis by Leskovec et al. may shed new light on our 
understanding of community structure and its detection 
in large networks. The fact that the "best" communities 
appear to have a characteristic size of about 100 vertices 
is consistent with Dunbar conjecture that 150 is the up- 
per size limit for a working human community (Dunbar, 
1998). On the other hand, if large communities are very 
mixed with each other, as Leskovec et al. claim, they 
could hardly be considered communities, and the alleged 
"community structure" of large networks would be lim- 
ited to their peripheral region. The results by Leskovec 
et al. may be affected by the properties of conductance, 
and need to be validated with alternative approaches. In 
any case, whatever the value of the quality score of a clus- 
ter may be (low or high), it is necessary to estimate the 
significance of the cluster (Section XIV), before deciding 
whether it is a meaningful structure or not. 

If the community structure of a graph is known, it 
is possible to classify vertices according to their roles 
within their community, which may allow to infer individ- 
ual properties of the vertices. A promising classification 
has been proposed by Guimera and Amaral (Guinicra 
and Amaral, 2005; Guimera and Amaral, 2005). The 
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FIG. 37 Regions of the z — P plane defining the roles of 
vertices in the modular structure of a graph, according to the 
scheme of Guimera and Amaral (Guimera and Amaral, 2005; 
Guimera and Amaral, 2005). Reprinted figure with permis- 
sion from Ref. (Guimera and Amaral, 2005). ©2005 by the 
Nature Publishing Group. 



FIG. 36 Analysis of communities in large real networks by 
Leskovec et al. (Leskovec et ai, 2008). (Left) Typical shape 
of the network community profile plot (NCPP), showing how 
the minimum conductance of subgraphs of size n varies with 
n. The plot indicates that the "best" communities have a size 
of about 100 vertices (minimum of the curve), whereas com- 
munities of larger sizes are not well-defined. In the plot two 
other NCPPs are shown: the one labeled Rewired network cor- 
responds to a randomized version of the network, where edges 
are randomly rewired by keeping the degree distribution; the 
one labeled Bag of whiskers gives the minimum conductance 
scores of clusters composed of disconnected pieces. (Right) 
Scheme of the core-periphery structure of large social and in- 
formation networks derived by Leskovec et al. based on the 
results of their empirical analysis. Most of the vertices are in 
a central core, which does not have a clear community struc- 
ture, whereas the best communities, which are rather small, 
are weakly connected to the core. Reprinted figure with per- 
mission from Ref. (Leskovec et al, 2008). 



role of a vertex depends on the values of two indices, the 
within-module degree and the participation ratio (though 
other variables may be chosen, in principle). The within- 
module degree Zi of vertex i is defined as 



where Ki is the internal degree of i in its cluster s^, Rg. and 
CTk^. the average and standard deviation of the internal 
degrees for all vertices of cluster Si. The within-module 
degree is then defined as the z-score of the internal degree 
Ki- Large values of z indicate that the vertex has many 
more neighbors within its community than most other 
vertices of the community. Vertices with z > 2.5 are 
classified as hubs, if z < 2.5 they are non-hubs. The 
participation ratio Pi of vertex i is defined as 

Here Kis is the internal degree of i in cluster s, ki the 
degree of i. Values of P close to 1 indicate that the 
neighbors of the vertex are uniformly distributed among 
all clusters; if all neighbors are within the cluster of the 
vertex, instead, P = 0. Based on the values of the 
pair (z, P), Guimera and Amaral distinguished seven 
roles for the vertices. Non-hub vertices can be ultra- 
peripheral [P w 0), peripheral (P < 0.625), connectors 
(0.625 < P < 0.8) and Unless vertices {P > 0.8). Hub 
vertices are classified in provincial hubs {P 0.3), 
connector hubs (0.3 < P < 0.75) and kinless hubs 
[P > 0.75). The regions of the z — P plane correspond- 
ing to the seven roles are highlighted in Fig. 37. We 
stress that the actual boundaries of the regions can be 
chosen rather arbitrarily. On graphs without commu- 
nity structure, like Erdos-Renyi (Erdos and Rcnyi, 1959) 
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random graphs and Barabasi- Albert (Barabasi and Al- 
bert, 1999) graphs (Section A. 3), non-hubs are mostly 
kinless vertices. In addition, if there are hubs, like in 
Barabasi- Albert graphs, they are kinless hubs. Kinless 
hubs (non-hubs) vertices have less than half (one third) of 
their neighbors inside any cluster, so they are not clearly 
associated to a cluster. On real graphs, the topologi- 
cal roles can be correlated to functions of vertices: in 
metabolic networks, for instance, connector hubs, which 
share most edges with vertices of other clusters than their 
own, are often metabolites which are more conserved 
across species than other metabolites, i. e. they have 
an evolutionary advantage (Guimera and Amaral, 2005). 

If communities are overlapping, one can explore other 
statistical properties, like the distributions of the over- 
laps and of the vertex memberships. The overlap is de- 
fined as the number of vertices shared by each pair of 
overlapping clusters; the membership of a vertex is the 
number of communities including the vertex. Both dis- 
tributions turn out to be skewed, so there seem to be no 
characteristic values for the overlap and the membership. 
Moreover, one could derive a network, where the commu- 
nities are the vertices and pairs of vertices are connected 
if their corresponding communities overlap (Palla et al., 

2005) . Such networks seem to have some special proper- 
ties. For instance, the degree distribution is a particular 
function, with an initial exponential decay followed by 
a slower power law decay'^^. We stress that the above 
results have been obtained with the Clique Percolation 
Method by Palla et al. (Section XI. A) and it is not clear 
whether other techniques would confirm them or not. In 
a recent analysis it has been shown that the degree dis- 
tribution of the network of communities can be repro- 
duced by assuming that the graph grows according to a 
simple preferential attachment mechanism, where com- 
munities with large degree have an enhanced chance to 
interact /overlap with new communities (Pollner et ai, 

2006) . 

XVII. APPLICATIONS ON REAL-WORLD NETWORKS 

The ultimate goal of clustering algorithms is try- 
ing to infer properties of and relationships between 
vertices, that are not available from direct observa- 
tion/measurement. If the scientific community agrees on 
a set of reliable techniques, one could then proceed with 
careful investigations of systems in various domains. So 
far, most works in the literature on graph clustering fo- 
cused on the development of new algorithms, and appli- 
cations were limited to those few benchmark graphs that 
one typically uses for testing (Section XV. A). Still, there 



This holds for the networks considered by Palla et al. (Palla et al, 
2005) like, e. g., the word association network (Section II) and a 
coauthorship network of physicists. There is no a priori reason 
to believe that this result is general. 



are also applications aiming at understanding real sys- 
tems. Some results have been actually mentioned in the 
previous sections. This section is supposed to give a fla- 
vor of what can be done by using clustering algorithms. 
Therefore, the list of works presented here is by no means 
exhaustive. Most studies focus on biological and social 
networks. We mention a few applications to other types 
of networks as well. 



A. Biological networks 

The recent abundance of genomic data has allowed us 
to explore the cell at an unprecedented depth. A wealth 
of information is available on interactions involving pro- 
teins and genes, metabolic processes, etc. In order to 
study cellular systems, the graph representation is regu- 
larly used. Protein-protein interaction networks (PIN), 
gene regulatory networks (GRN) and metabolic networks 
(MN) are meanwhile standard objects of investigation in 
biology and bioinformatics (Junker and Schreiber, 2008). 

Biological networks are characterized by a remarkable 
modular organization, reflecting functional associations 
between their components. For instance, proteins tend 
to be associated in two types of cellular modules: protein 
complexes and functional modules. A protein complex is 
a group of proteins that mutually interact at the same 
time and space, forming a sort of physical object. Exam- 
ples are transcription factor complexes, protein transport 
and export complexes, etc. Functional modules instead 
are groups of proteins taking place in the same cellu- 
lar process, even if the interactions may happen at dif- 
ferent times and places. Examples are the CDK/cyclin 
module, responsible for cell-cycle progression, the yeast 
pheromone response pathway, etc.. Identifying cellular 
modules is fundamental to uncover the organization and 
dynamics of cell functions. However, the information on 
cell units (e. g. proteins, genes) and their interactions is 
often incomplete, or even incorrect, due to noise in the 
data produced by the experiments. Therefore, inferring 
modules from the topology of cellular networks enables 
one to restrict the set of possible scenarios and can be a 
safe guide for future experiments. 

Rives and Galitski (Rives and Galitski, 2003) stud- 
ied the modular organization of a subset of the PIN 
of the yeast [Saccharomyces cerevisiae), consisting of 
the (signaling) proteins involved in the processes lead- 
ing the microorganism to a filamentous form. The clus- 
ters were detected with a hierarchical clustering tech- 
nique. Proteins mostly interacting with members of their 
own cluster are often essential proteins; edges between 
modules are important points of communication. Spirin 
and Mirny (Spirin and Mirny, 2003) identified protein 
complexes and functional modules in yeast with different 
techniques: clique detection, superparamagnetic cluster- 
ing (Blatt et al., 1996) and optimization of cluster edge 
density. They estimated the statistical significance of the 
clusters by computing the p-values of seeing those clus- 
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ters in random graphs with the same expected degree 
sequence as the original network. From the known func- 
tional annotations of yeast genes one can see that the 
modules usually group proteins with the same or con- 
sistent biological functions. Indeed, in many cases, the 
modules exactly coincide with known protein complexes. 
The results appear robust if noise is introduced in the 
system, to simulate the noise present in the experimental 
data. Functional modules in yeast were also found by 
Chen and Yuan (Chen and Yuan, 2006), who applied the 
algorithm by Girvan and Newman with a modified defi- 
nition of edge betweenness (Section V.A). The standard 
Girvan-Newman algorithm has proved to be reliable to 
detect functional modules in PINs (Dunn et at, 2005). 
The novelty of the work by Chen and Yuan is its focus 
on weighted PINs, where the weights come from infor- 
mation derived through microarray expression profiles. 
Weights add information about the system and should 
lead to a more reliable modular structure. By knocking 
out genes in the same structural cluster similar pheno- 
types appeared, suggesting that the genes have similar 
biological roles. Moreover, the clusters often contained 
known protein complexes, either entirely or to a large 
extent. Finally, Chen and Yuan were able to make pre- 
dictions of the unknown function of some genes, based on 
the structural module they belong to: gene function pre- 
diction is the most promising outcome deriving from the 
application of clustering techniques to PINs. Farutin et 
al. (Farutin et at, 2006) have adopted a local concept of 
community, and derived a hierarchical decomposition of 
PINs, in that the modules identified at some level become 
the vertices of a network at the higher level. Communi- 
ties are overlapping, to account for the fact that proteins 
(and whole modules) may have diverse biological func- 
tions. High level structures detected in a human PIN cor- 
respond to general biological concepts like signal trans- 
duction, regulation of gene expression, intercellular com- 
munication. Sen et al. (Sen et al., 2006) identified protein 
clusters for yeast from the eigenvectors of the Laplacian 
matrix (Section A. 2), computed via Singular Value De- 
composition. In a recent analysis, Lewis et al. (Lewis 
et al., 2009) carefully explored the relationship between 
structural communities of PINs and their biological func- 
tion. Communities were detected with the multiresolu- 
tion approach by Reichardt and Bornholt (Rcichardt and 
Bornholdt, 2006a) (Section VLB). A community is con- 
sidered biologically homogeneous if the functional simi- 
larity between protein pairs of the community (extracted 
through the Gene Ontology database (Ashburner et al., 
2000)) is larger than the functional similarity between all 
protein pairs of the network. Lewis et al. also specified 
the comparison to interacting and non-interacting pro- 
tein pairs. As a result, many communities turn out to 
be biologically homogeneous, especially if they are not 
too small. Moreover, some topological attributes of com- 
munities, like the within-community clustering coefficient 
(i.e. the average value of the clustering coefficients of 
the vertices of a community, computed by considering 



just the neighbors belonging to the community) and link 
density (density of internal edges), are good indicators 
of biological homogeneity: the former is strongly corre- 
lated with biological homogeneity, independently of the 
community size, whereas for the latter the correlation is 
strong for large communities. 

Metabolic networks have also been extensively investi- 
gated. We have already discussed the "functional cartog- 
raphy" designed by Guimera and Amaral (Guimera and 
Amaral, 2005; Guimera and Amaral, 2005), which applies 
to general types of networks, not necessarily metabolic. 
A hierarchical decomposition of metabolic networks has 
been derived by Holme et al. (Holme et at, 2003), by 
using a hierarchical clustering technique inspired by the 
algorithm by Girvan and Newman (Section V.A). Here, 
vertices are removed based on their betweenness val- 
ues, which are obtained by dividing the standard site 
betweenness scores (Freeman, 1977) by the indegree of 
the respective vertices. A picture of metabolic network 
emerges, in which there are core clusters centered at 
hub-substances, surrounded by outer shells of less con- 
nected substances, and a few other clusters at interme- 
diate scales. In general, clusters at different scales seem 
to be meaningful, so the whole hierarchy should be taken 
into account. 

Wilkinson and Huberman (Wilkinson and Huber- 
man, 2004) analyzed a network of gene co-occurrence 
to find groups of related genes. The network is built 
by connecting pairs of genes that are mentioned to- 
gether in the abstract of articles of the Medline database 
(http : //medline . cos . com/). Clusters were found with 
a modified version of the algorithm by Girvan and New- 
man, in which edge betweenness is computed by consid- 
ering the shortest paths of a small subset of all vertex 
pairs, to gain computer time (Section V.A). As a result, 
genes belonging to the same cluster turn out to be func- 
tionally related to each other. Co-occurrence of terms is 
also used to extract associations between genes and dis- 
eases, to find out which genes are relevant for a specific 
disease. Communities of genes related to colon cancer 
can be helpful to identify the function of the genes. 



B. Social networks 

Networks depicting social interactions between people 
have been studied for decades (Scott, 2000; Wasscrman 
and Faust, 1994). Recently the modern Information and 
Communication Technology (ICT) has opened new in- 
teraction modes between individuals, like mobile phone 
communications and online interactions enabled by the 
Internet. Such new social exchanges can be accurately 
monitored for very large systems, including millions of 
individuals, whose study represents a huge opportunity 
for social science. Communities of social networks can 
be friendship circles, groups of people sharing common 
interests and/or activities, etc.. 

Blondel et al. have analyzed a network of mobile phone 
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FIG. 38 Community structure of a social network of mobile 
phone communication in Belgium. Dots indicate subcommu- 
nities at the lower hierarchical level (with more than 100 peo- 
ple) and are colored in a red-green scale to represent the level 
of representation of the two main languages spoken in Belgium 
(red for French and green for Dutch) . Communities of the two 
larger groups are linguistically homogeneous, with more than 
85% of people speaking the same language. Only one commu- 
nity (zoomed), which lies at the border between the two main 
aggregations, has a more balanced distribution of languages. 
Reprinted figure with permission from Ref. (Blondel et al, 
2008). ©2008 by lOP Publishing and SISSA. 



communications between users of a Belgian phone oper- 
ator (Blondel et at, 2008). The vertices of the graph are 
2.6 millions and the edges are weighted by the cumulative 
duration of phone calls between users in the observation 
time frame. The clustering analysis, performed with a 
fast hierarchical modularity optimization technique de- 
veloped by the authors (discussed in Section VI. A. 1), de- 
livers six hierarchical levels. The highest level consists 
of 261 groups with more than 100 vertices, which are 
clearly arranged in two main groups, linguistically ho- 
mogeneous, reflecting the linguistic split of Belgian pop- 
ulation (Fig. 38). Tyler et al. (Tyler et at, 2003) studied 
a network of e-mail exchanges between people working 
at HP Labs. They applied the same modified version of 
Girvan-Newman algorithm that two of the authors have 
used to find communities of related genes (Wilkinson and 
Hubcrman, 2004) (Section XVII. A). The method enables 
one to measure the degree of membership of each vertex 
in a community and allows for overlaps between com- 
munities. The detected clusters matched quite closely 
the organization of the Labs in departments and project 
groups, as confirmed by interviews conducted with re- 




FIG. 39 Communities in social networking sites. (Top) Vi- 
sualization of a network of friendships between students at 
Caltech, constructed from Facebook data (September 2005). 
The colors/shapes indicate the dormitories (Houses) of the 
students. (Bottom) Topological communities of the network, 
which are quite homogeneous with respect to House affilia- 
tion. Reprinted figures with permission from Refs. (Porter 
et ai, 2009) and (Traud et al, 2008). 



searchers. 

Social networking sites, like Myspace 
(www.myspace.com), Friendster (www.friendster.com), 
Facebook (www.facebook.com), etc. have become 
extremely popular in the last years. They are online 
platforms that allow people to communicate with friends, 
send e-mails, solicit opinions on specific issues, spread 
ideas and/or fads, etc. Traud et al. (Traud et ai, 2008) 
used anonymous Facebook data to create networks of 
friendships between students of different American uni- 
versities, where vertices/students are connected if they 
are friends on Facebook. Communities were detected by 
applying a variant of Newman's spectral optimization 
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of modularity (Section VI. A. 4): the results were further 
refined through additional steps a la Kernighan-Lin 
(Section IV. A). One of the goals of the study was to 
infer relationships between the online and offiine lives 
of the students. By using demographic information on 
the students' populations, one finds that communities 
are organized by class year or by House (dormitory) 
affiliation, depending on the university (Fig. 39). Yuta 
et al. (Yuta et al., 2007) observed a gap in the commu- 
nity size distribution of a friendship network extracted 
from mixi (mixi.jp), the largest social networking site 
in Japan (as of December 2006). Communities were 
identified with the fast greedy modularity optimization 
by Clauset et al. (Clauset et al., 2004). The gap occurs 
in the intermediate range of sizes between 20 and 400, 
where but a few communities are observed. Yuta et al. 
introduced a model where people form new friendships 
both by "closing" ties with people who are friends of 
friends, and by setting new links with individuals having 
similar interests. In this way most groups turn out to be 
either small or large, and medium size groups are rare. 

Collaboration networks, in which individuals are linked 
if they are (were) involved in a common activity, have 
been often studied because they embed an implicit ob- 
jective concept of acquaintance, that is not easy to cap- 
ture in direct social experiments/interviews. For in- 
stance, somebody may consider another individual a 
friend, while the latter may disagree. A collaboration 
instead is a proof of a social relationship between indi- 
viduals. The analysis of the structure of scientific collab- 
oration networks (Newman, 2001) has exerted a big influ- 
ence on the development of the modern network science. 
Scientific collaboration is associated to coauthorship: two 
scientists are linked if they have coauthored at least one 
paper together. Information about coauthorships can 
be extracted from different databases of research papers. 
Communities indicate groups of people with common re- 
search interests, i. e. topical or disciplinary groups. In 
the seminal paper by Girvan and Newman (Girvan and 
Newman, 2002), the authors applied their method on a 
collaboration network of scientists working at the Santa 
Fe Institute, and were able to discriminate between re- 
search divisions (Fig. 2b). The community structure of 
scientific collaboration networks has been investigated by 
many authors (Danon et al., 2006; Donetti and Muhoz, 
2004; Duch and Arenas, 2005; Farkas et al., 2007; Gre- 
gory, 2007; Lehmann and Hansen, 2007; Nepusz et al., 
2008; Newman, 2004b, 2006a; Noack and Rotta, 2009; 
Palla et al., 2007, 2005; Pujol et al., 2006; Radicchi et al., 
2004; Rcichardt and Bornholdt, 2006a; Richardson et al., 
2009; S.-W. Son et at, 2006; Shen et al, 2009; Vragovic 
and Louis, 2006; White and Smyth, 2005; Zhou, 2003b). 
Other types of collaboration networks have been studied 
too. Gleiser and Danon (Gleiser and Danon, 2003) con- 
sidered a collaboration network of jazz musicians. Ver- 
tices are either musicians, connected if they played in the 
same band, or bands, connected if they have a musician 
in common. By applying the algorithm of Girvan and 



Newman they found that communities reflect both racial 
segregation (with two main groups comprising only black 
or white players) and geographical separation, due to the 
different recording locations. 



C. Other networks 

Citation networks (de SoUa Price, 1965) have been reg- 
ularly used to understand the citation patterns of authors 
and to disclose relationships between disciplines. Rosvall 
and Bergstrom (Rosvall and Bergstrom, 2008) used a ci- 
tation network of over 6000 scientific journals to derive a 
map of science. They used a clustering technique based 
on compressing the information on random walks taking 
place on the graph (Section IX. B). A random walk fol- 
lows the flow of citations from one field to another, and 
the fields emerge naturally from the clustering analysis 
(Fig. 40). The structure of science resembles the letter U, 
with the social sciences and engineering at the terminals, 
joined through a chain including medicine, molecular bi- 
ology, chemistry and physics. 

Reichardt and Bornholdt (Reichardt and Bornholdt, 
2007) performed a clustering analysis on a network built 
from bidding data taken from the German version of 
Ebay (www.ebay.de), the most popular online auction 
site. The vertices are bidders and two vertices are con- 
nected if the corresponding bidders have expressed in- 
terest for the same item. Clusters were detected with 
the mult iresolut ion modularity optimization developed 
by the authors themselves (Reichardt and Bornholdt, 
2006a) (Section VLB). In spite of the variety of items 
that it is possible to purchase through Ebay, about 85% of 
bidders were classified into a few major clusters, reflect- 
ing bidders' broad categories of interests. Ebay data were 
also examined by Jin et al. (Jin et al., 2007), who consid- 
ered bidding networks where the vertices are the individ- 
ual auctions and edges are placed between auctions hav- 
ing at least one common bidder. Communities, detected 
with greedy modularity optimization (Newman, 2004b) 
(Section VI.A.l), allow to identify substitute goods, i. e. 
products that have value for the same bidder, so that 
they can be purchased together or alternatively. 

Legislative networks enable one to deduce associations 
between politicians through their parliamentary activity, 
which may be related or not to party affiliation. Porter 
and coworkers have carried out numerous studies on the 
subject (Porter et al., 2007, 2005; Zhang et al, 2008), 
by using data on the Congress of the United States. In 
Refs. (Porter et at, 2007, 2005), they examined the com- 
munity structure of networks of committees in the US 
House of Representatives. Committees sharing common 
members are connected by edges, which are weighted by 
dividing the number of common members by the num- 
ber one would expect to have if committee memberships 
were randomly assigned. Hierarchical clustering (Sec- 
tion IV. B) reveals close connections between some of 
the committees. In another work (Zhang et al., 2008), 
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FIG. 40 Map of science derived from a clustering analysis of a citation network comprising more than 6000 journals. Reprinted 
figure with permission from Ref. (Rosvall and Bergstrom, 2008). ©2008 by the National Academy of Science of the USA. 



Zhang et al. analyzed networks of legislation cospon- 
sorship, in which vertices are legislators and two legisla- 
tors are linked if they support at least one common bill. 
Communities, identified with a modification of Newman's 
spectral optimization of modularity (Section VI. A. 4), are 
correlated with party affiliation, but also with geography 
and committee memberships of the legislators. 

Networks of correlations between time series of stock 
returns have received a growing attention in the past few 
years (Mantegna, 1999). In early studies, scholars found 
clusters of correlated stocks by computing the maximum 
spanning tree of the network (Bonanno et ai, 2003, 2000; 
Onncla et ai, 2003, 2002) (Section A.l), and realized 



that such clusters match quite well the economic sectors 
of the stocks. More recently, the community structure of 
the networks has been investigated by means of proper 
clustering algorithms. Farkas et al. (Farkas et ai, 2007) 
have applied the weighted version of the Clique Percola- 
tion Method (Section XI. A) and found that the presence 
of two strong (i. e. carrying high correlation) edges in tri- 
angles is usually accompanied by the presence of a strong 
third edge. Heimo et al. (Heimo et al., 2008) used the 
weighted version of the multiresolution method by Re- 
ichardt and Bornholdt (Reichardt and Bornholdt, 2006a) 
(Section VLB). Clusters correspond to relevant business 
sectors, as indicated by Forbes classification; moreover. 
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smaller clusters at lower hierarchical levels seem to corre- 
spond to (economically) meaningful substructures of the 
main clusters. 



XVIII. OUTLOOK 

Despite the remote origins and the great popularity of 
the last years, research on graph clustering has not yet 
given a satisfactory solution of the problem and leaves 
us with a number of important open issues. From our 
exposition it appears that the field has grown in a rather 
chaotic way, without a precise direction or guidelines. In 
some cases, interesting new ideas and tools have been pre- 
sented, in others existing methods have been improved, 
becoming more accurate and/or faster. 

What the field lacks the most is a theoretical frame- 
work that defines precisely what clustering algorithms 
are supposed to do. Everybody has his/her own idea of 
what a community is, and most ideas are consistent with 
each other, but, as long as there is still disagreement, it 
remains impossible to decide which algorithm does the 
best job and there will be no control on the creation of 
new methods. Therefore, we believe that the first and 
foremost task that the scientific community working on 
graph clustering has to solve in the future is defining a set 
of reliable benchmark graphs, against which algorithms 
should be tested (Section XV. A). Defining a benchmark 
goes far beyond the issue of testing. It means designing 
practical examples of graphs with communities, and, in 
order to do that, one has to agree on the fundamental 
concepts of community and partition. Clustering algo- 
rithms have to be devised consistently with such defini- 
tions, in order to give the best performance on the set of 
designated benchmarks, which represent a sort of ground 
truth. The explosion in the number of algorithms we have 
witnessed in recent times is due precisely to the present 
lack of reliable mechanisms of control of their quality 
and comparison of their performances. If the commu- 
nity agrees on a benchmark, the future development of 
the field will be more coherent and the progress brought 
by new methods can be evaluated in an unbiased man- 
ner. The planted ^-partition model (Condon and Karp, 
2001) is the easiest recipe one can think of when it comes 
to defining clusters, and is the criterion underlying well- 
known benchmarks, like that by Girvan and Newman. 
We believe that the new benchmarks have to be defined 
along the same lines. The benchmark graphs recently in- 
troduced by Lancichinetti et al. (Lancichinetti and Fortu- 
nato, 2009; Lancichinetti et ai, 2008) and by Sawardecker 
et al. (Sawardecker et ai, 2009) are an important step in 
this direction. 

Defining a benchmark implies specifying the "natural" 
partition of a graph, the one that any algorithm should 
find. This issue in turn involves the concept of quality 
of a partition, that has characterized large part of the 
development of the field, in particular after the intro- 
duction of Newman- Girvan modularity (Section III.C.2). 



Estimating the quality of a partition allows to discrimi- 
nate among the large number of partitions of a graph. In 
some cases this is not difficult. For instance, in the bench- 
mark by Girvan and Newman there is a single meaningful 
partition, and it is hard to argue with that. But most 
graphs of the real world have a hierarchical structure, 
with communities including smaller communities and so 
on. Hence there are several meaningful partitions, cor- 
responding to different hierarchical levels, and discrimi- 
nating among them is hard, as they may be all relevant, 
in a sense. If we consider the human body, we cannot 
say that the organization in tissues of the cells is more 
important than the organization in organs. We have seen 
that there are recent methods dealing with the problem of 
finding meaningful hierarchical levels (Section XII). Such 
methods rank the hierarchical partitions based on some 
criterion and one can assess their relevance through the 
ranking. One may wonder whether it makes sense sorting 
out levels, which means introducing a kind of threshold 
on the quality index chosen to rank partitions (to dis- 
tinguish "good" from "bad" partitions), or whether it 
is more appropriate to keep the information given by the 
whole set of hierarchical partitions. The work by Clauset 
et al. on hierarchical random graphs (Clauset et ai, 
2007; Clauset et al., 2008), discussed in Section XII.B, 
indirectly raises this issue. There it was shown that the 
ensemble of model graphs, represented by dendrograms, 
encodes most of the information on the structure of the 
graph at study, like its degree distribution, transitivity 
and distribution of shortest path lengths. At the same 
time, by construction, the model reveals the whole hier- 
archy of communities, without any distinction between 
good and bad partitions. The information given by a 
dendrogram may become redundant and confusing when 
the graph is large, as then there is a big number of par- 
titions. This is actually the reason why quality functions 
were originally introduced. However, in that case, one 
was dealing with artificial hierarchies, produced by tech- 
niques that systematically yield a dendrogram as a result 
of the analysis (like, e. g., hierarchical clustering), re- 
gardless of whether the graph actually has a hierarchical 
structure or not. Here instead we speak of real hierarchy, 
which is a fundamental element of real graphs and, as 
such, it must be considered in any serious approach to 
graph clustering. Any good clustering method must be 
able to tell whether a graph has community structure or 
not, and, in the first case, whether the community struc- 
ture is hierarchical (i. e. with two or more levels) or flat 
(one level) . We expect that the concept of hierarchy will 
become a key ingredient of future clustering techniques. 
In particular, assessing the consistence of the concepts of 
partitions' quality and hierarchical structure is a major 
challenge. 

A precise definition of null models, i. e. of graphs with- 
out community structure, is also missing. This aspect is 
extremely important, though, as defining communities 
also implies deciding whether or not they exist in a spe- 
cific graph. At the moment, it is generally accepted that 
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random graphs have no communities. The null model 
of modularity (Section III.C.2), by far the most popu- 
lar, comprises all graphs with the same expected degree 
sequence of the original graph and random rewiring of 
edges. This class of graphs is characterized, by construc- 
tion, by the fact that any vertex can be linked to any 
other, as long as the constraint on the degree sequence 
is satisfied. But this is by no means the only possibility. 
A community can be generically defined as a subgraph 
whose vertices have a higher probability to be connected 
to the other vertices of the subgraph than to external 
vertices. The planted ^-partition model (Condon and 
Karp, 2001) is based on this principle, as we have seen. 
However, this does not mean that the linking probabili- 
ties of a vertex with respect to the other vertices in its 
community or in different communities be constant (or 
simply proportional to their degrees, as in the configura- 
tion model (Luczak, 1992; MoUoy and Reed, 1995)). In 
fact, in large networks it is reasonable to assume that the 
probability that a vertex is linked to most vertices is zero, 
as the vertex "ignores" their existence. This does not ex- 
clude that the probability that the vertex gets connected 
to the "known" vertices is the same (or proportional to 
their degrees) , in which case the graph would still be ran- 
dom and have no communities. We believe that we are 
still far from a precise definition and a complete clas- 
sification of null models. This represents an important 
research line for the future of the field, for three main rea- 
sons: 1) to better disentangle "true" communities from 
byproducts of random fiuctuations; 2) to pose a stringent 
test to existing and future clustering algorithms, whose 
reliability would be questionable if they found "false pos- 
itives" in null model graphs; 3) to handle "hybrid" sce- 
narios, where a graph displays community structure on 
some portions of it, while the rest is essentially random 
and has no communities. 

In the previous chapters we have seen a great number 
of clustering techniques. Which one(s) shall we use? At 
the moment the scientific community is unable to tell. 
Modularity optimization is probably the most popular 
method, but the results of the analysis of large graphs 
are likely to be unreliable (Section VI. C). Nevertheless, 
people have become accustomed to use it, and there have 
been several attempts to improve the measure. A new- 
comer, who wishes to find clusters in a given network 
and is not familiar with clustering techniques, would not 
know, off-hand, which method to use, and he/she would 
hardly find indications about good methods in any single 
paper on graph clustering, except perhaps on the method 
presented in the paper. So, people keep using algorithms 
because they have heard of them, or because they know 
that other people are using them, or because of the rep- 
utation of the scientists who designed them. Waiting 
for future reliable benchmarks, that may give an objec- 
tive assessment of the quality of the algorithms, there 
are at the moment hardly solid reasons to prefer an algo- 
rithm to another: the comparative analyses by Danon et 
al. (Danon et ai, 2005) and by Lancichinetti and Fortu- 



nato (Lancichinetti and Fortunato, 2009) (Section XV. C) 
represent a first serious assessment of this issue. However, 
we want to stress that there is no such thing as the perfect 
method, so it is pointless to look for it. Among the other 
things, if one tries to look for a very general method, that 
should give good results on any type of graphs, one is in- 
evitably forced to make very general assumptions on the 
structure of the graph and on the properties of communi- 
ties. In this way one neglects a lot of specific features of 
the system, that may lead to a more accurate detection 
of the clusters. Informing a method with features charac- 
terizing some types of graphs makes it far more reliable 
to detect the community structure of those graphs than a 
general method, even if its applicability may be limited. 
Therefore in the future we envision the development of 
domain-specific clustering techniques. The challenge here 
is to identify the peculiar features of classes of graphs, 
which are bound to become crucial ingredients in the de- 
sign of suitable algorithms. Some of the methods avail- 
able today are actually based on assumptions that hold 
only for some specific categories of graphs. The Clique 
Percolation Method by Palla et al. (Palla et ai, 2005), 
for instance, may work well for graphs characterized by 
a large number of cliques, like certain social networks, 
whereas it may give poor results otherwise. 

Moving one step further, one should learn how to use 
specific information about a graph, whenever available, 
e. g. properties of vertices and/or partial information 
about their classification. For instance, it may be that 
one has some information on a subset of vertices, like 
demographic data on people of a social network, and such 
data may highlight relationships between people that are 
not obvious from the network of social interactions. In 
this case, using only the social network may be reductive 
and ideally one should exploit both the structural and 
the non-structural information in the search of clusters, 
as the latter should be consistent with both inputs. How 
to do this is an open problem. The scientific community 
has just begun to study this aspect (AUahverdyan and 
Galstyan, 2009). 

Most algorithms in the literature deal with the "clas- 
sical" case of a graph with undirected and unweighted 
edges. This is certainly the simplest case one could think 
of, and graph clustering is already a complex task on such 
types of graphs. We know that real networks may be 
directed, have weighted connections, be bipartite. Meth- 
ods to deal with such systems have been developed, as 
we have seen, especially in the most recent literature, 
but they are mostly preliminary attempts and there is 
room for improvement. Another situation that may oc- 
cur in real systems is the presence of edges with posi- 
tive and negative weights, indicating attractive and re- 
pulsive interactions, respectively. This is the case, for 
instance, of correlation data (Mantegna, 1999). In this 
case, ideal partitions would have positively weighted intr- 
acluster edges and negatively weighted intercluster edges. 
We have discussed some studies in this direction (Gomez 
et ai, 2009; Kaplan and Forrest, 2008; Traag and Brugge- 
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man, 2009), but we are just at the beginning of this en- 
deavour. Instead, there are no algorithms yet which are 
capable to deal with graphs in which there are edges of 
several types, indicating different kinds of interactions 
between the vertices (multigraphs) . Agents of social net- 
works, for instance, may be joined by working relation- 
ships, friendship, family ties, etc. At the moment there 
are essentially two ways of proceeding in these instances: 
1) keeping edges of one type and forgetting the others, 
repeating the analysis for each type of edges and eventu- 
ally comparing the results obtained; 2) analyzing a single 
(weighted) graph, obtained by "combining" the contribu- 
tions of the different types of edges in some way. Finally, 
since most real networks are built through the results 
of experiments, which carry errors in their estimates, it 
would be useful to consider as well the case in which edges 
have not only associated weights, but also errors on their 
values. 

Since the paper by Palla et al. (Palla et ai, 2005), 
overlapping communities have received a lot of attention 
(Section XI). However, there is still no consensus about 
a quantitative definition of the concept of overlapping 
community, and most definitions depend on the method 
adopted. Intuitively, one would expect that clusters share 
vertices lying at their borders, and this idea has inspired 
most algorithms. However, clusters detected with the 
Clique Percolation Method (Section XI. A) often share 
central vertices of the clusters, which makes sense in spe- 
cific instances, especially in social networks. So, it is still 
unclear how to characterize overlapping vertices. More- 
over, the concept of overlapping clusters seems at odds 
with that of hierarchical structure. No dendrogram can 
be drawn if there are overlapping vertices, at least in the 
standard way. Due to the relevance of both features in 
real networks, it is necessary to adapt them to each other 
in a consistent way. Overlapping vertices pose problems 
as well when it comes to comparing the results of different 
methods on the same graph. Most similarity measures 
are defined only in the case of partitions, where each ver- 
tex is assigned to a single cluster (Section XV. B). It is 
then necessary to extend such definitions to the case of 
overlapping communities, whenever possible. 

Another issue that is getting increasingly more popular 
is the study of graphs evolving in time. This is now pos- 
sible due to the availability of timestamped network data 
sets. Tracking the evolution of community structure in 
time is very important, to uncover how communities are 
generated and how they interact with each other. Schol- 
ars have just begun to study this problem (Asur et ai, 
2007; Chakrabarti et ai, 2006; Chi et ai, 2007; Fcnn 
et ai, 2009; Hopcroft et ai, 2004; Kim and Han, 2009; 
Lin et ai, 2008; Palla et at, 2007; Sun et ai, 2007) (Sec- 
tion XIII). Typically one analyzes separately snapshots 
at different times and checks what happened at time t -I- 1 
to the communities at time t. It would be probably bet- 
ter to use simultaneously the whole dynamic data set, 
and future work shall aim at defining proper ways to do 
that. In this respect, the evolutionary clustering frame- 



work by Chakrabarti et al. (Chakrabarti et al, 2006) is 
a promising starting point. 

The computational complexity of graph clustering al- 
gorithms has improved by at least one power in the graph 
size (on average) in just a couple of years. Due to the 
large size of many systems one wishes to investigate, the 
ultimate goal would be to design techniques with lin- 
ear or even sublinear complexity. Nowadays partitions 
in graphs with up to millions of vertices can be found. 
However, the results are not yet very reliable, as they are 
usually obtained by greedy optimizations, which yield 
rough approximations of the desired solution. In this 
respect the situation could improve by focusing on the 
development of efficient local methods, for two reasons: 
1) they enable analyses of portions of the graph, indepen- 
dently of the rest; 2) they are often suitable for parallel 
implementations, which may speed up considerably the 
computation. 

Finally, if there has been a tremendous effort in the de- 
sign of clustering algorithms, basically nothing has been 
done to make sense of their results. What shall we do 
with communities? What can they tell us about a sys- 
tem? The hope is that they will enable one to disclose 
"hidden" relationships between vertices, due to features 
that are not known, because they are hard to measure, 
for instance. It is quite possible that the scientific com- 
munity will converge sooner or later to a definition a 
posteriori of community. Already now, most algorithms 
yield similar results in practical applications. But what 
is the relationship between the vertex classification given 
by the algorithms and real classifications? This is the 
main question beneath the whole endeavor. 

Acknowledgments 

I am indebted to these people for giving useful sugges- 
tions and advice to improve this manuscript at various 
stages: A. Arenas, J. W. Berry, A. Clauset, P. Csermely, 
S. Gomez, S. Gregory, V. Gudkov, R. Guimera, Y. Is- 
polatov, R. Lambiotte, A. Lancichinetti, J. -P. Onnela, 
G. Palla, M. A. Porter, F. Radicchi, J. J. Ramasco, C. 
Wiggins. I gratefully acknowledge ICTeCollective, grant 
number 238597 of the European Commission. 



Appendix A: Elements of Graph Theory 

1. Basic Definitions 

A graph is a pair of sets {V,E), where y is a set 
of vertices or nodes and is a subset of V'^, the set of 
unordered pairs of elements of V. The elements of E are 
called edges or links, the two vertices that identify an edge 
are called endpoints. An edge is adjacent to each of its 
endpoints. If each edge is an ordered pair of vertices one 
has a directed graph (or digraph). In this case an ordered 
pair {v,w) is an edge directed from v to w, or an edge 
beginning at v and ending at w. A graph is visualized as 
a set of points connected by lines, as shown in Fig. 41. 
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FIG. 41 A sample graph with seven vertices and seven edges. 



In many real examples, graphs are weighted, i. e. a real 
number is associated to each of the edges. Graphs do not 
include loops, i. e. edges connecting a vertex to itself, nor 
multiple edges, i. e. several edges joining the same pair of 
vertices. Graphs with loops and multiple edges are called 
multigraphs. Generalizations of graphs admitting edges 
between any number of vertices (not necessarily two) are 
called hypergraphs. 

A graph G' = iV',E') is a subgraph of G = {V,E) if 
V CV and E' C E. If Q' contains all edges of G that join 
vertices of V' one says that the subgraph Q' is induced 
or spanned hy V' . A partition of the vertex set V in two 
subsets S and — S' is called a cut; the cut size is the 
number of edges of G joining vertices of S with vertices 
of - S*. 

We indicate the number of vertices and edges of a 
graph with n and m, respectively. The number of vertices 
is the order of the graph, the number of edges its size. 
The maximum size of a graph equals the total number of 
unordered pairs of vertices, n{n — l)/2. If \V\ = n and 
\E\ = m = n{n — l)/2, the graph is a clique (or complete 
graph), and is indicated as Kn- Two vertices are neigh- 
bors (or adjacent) if they are connected by an edge. The 
set of neighbors of a vertex v is called neighborhood, and 
we shall denote it with T{v). The degree fc„ of a vertex v 
is the number of its neighbors. The degree sequence is the 
list of the degrees of the graph vertices, k^^, ky^, ky^. 
On directed graphs, one distinguishes two types of degree 
for a vertex v: the indegree, i. e. the number of edges be- 
ginning at V and the outdegree, i. e. the number of edges 
ending at v. The analogue of degree on a weighted graph 
is the strength, i. e. the sum of the weights of the edges 
adjacent to the vertex. Another useful local property of 
graphs is transitivity or clustering (Watts and Strogatz, 
1998), which indicates the level of cohesion between the 
neighbors of a vertex The clustering coefficient c„ of 



vertex v is the ratio between the number of edges joining 
pairs of neighbors of v and the total number of possible 
edges, given by ky{ky — l)/2, k^ being the degree of v. 
According to this definition, Cy measures the probabil- 
ity that a pair of neighbors of v are connected. Since 
all neighbors of v are connected to v by definition, edges 
connecting pairs of neighbors of v form triangles with 
V. This is why the definition is often given in terms of 
number of triangles. 

A path is a graph V = {V{V),E{V)), with V{r) = 
{xo, xi, xi} and E{V) = {xoXi, xia;2, xi^ixi}. The 
vertices xq and xi are the endvertices of V, whereas I 
is its length. Given the notions of vertices, edges and 
paths, one can define the concept of independence. A set 
of vertices (or edges) of a graph are independent if no two 
elements of them are adjacent. Similarly, two paths are 
independent if they only share the endvertices. A cycle 
is a closed path whose vertices and edges are all distinct. 
Cycles of length I are indicated with C;. The smallest 
non-trivial cycle is the triangle, C3. 

Paths allow to define the concept of connectivity and 
distance in graphs. A graph is connected if, given any 
pair of vertices, there is at least one path going from 
one vertex to the other. In general, there may be multi- 
ple paths connecting two vertices, with different lengths. 
A shortest path, or geodesic, between two vertices of a 
graph, is a path of minimal length. Such minimal length 
is the distance between the two vertices. The diameter 
of a connected graph is the maximal distance between 
two vertices. If there is no path between two vertices, 
the graph is divided in at least two connected subgraphs. 
Each maximal connected subgraph of a graph is called 
connected component. 

A graph without cycles is a forest. A connected forest 
is a tree. Trees are very important in graph theory and 
deserve some attention. In a tree, there can be only one 
path from a vertex to any other. In fact, if there were 
at least two paths between the same pair of vertices they 
would form a cycle, while the tree is an acyclic graph by 
definition. Further, the number of edges of a tree with 
n vertices is n — 1. If any edge of a tree is removed, 
it would get disconnected in two parts; if a new edge is 
added, there would be at least one cycle. This is why a 
tree is a minimally connected, maximally acyclic graph 
of a given order. Every connected graph contains a span- 
ning tree, i. e. a tree sharing all vertices of the graph. 
On weighted graphs, one can define a minimum (maxi- 
mum) spanning tree, i. e. a spanning tree such that the 
sum of the weights on the edges is minimal (maximal). 
Minimum and maximum spanning trees are often used in 
graph optimization problems, including clustering. 

A graph G is bipartite if the vertex set V is separated in 



The term clustering is commonly adopted to Indicate commu- 
nity detection in some disciplines, like computer science, and we 



often used it in this context throughout the manuscript. We 
paid attention to disambiguate the occurrences in which cluster- 
ing indicates instead the local property of a vertex neighborhood 
described here. 
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two disjoint subsets Vi and V2, or classes, and every edge 
joins a vertex of Vi with a vertex of V2. The definition 
can be extended to that of r -partition, where the vertex 
classes are r and no edge joins vertices within the same 
class. In this case one speaks of multipartite graphs. 



2. Graph Matrices 

The whole information about the topology of a graph 
of order n is entailed in the adjacency matrix A, which 
is an n X n matrix whose element Aij equals 1 if there 
is an edge joining vertices i and j, otherwise it is zero. 
Due to the absence of loops the diagonal elements of the 
adjacency matrix are all zero. For an undirected graph 
A is a symmetric matrix. The sum of the elements of 
the i-th row or column yields the degree of node i. If 
the edges are weighted, one defines the weight matrix 
W, whose element Wij expresses the weight of the edge 
between vertices i and j. 

The spectrum of a graph Q is the set of eigenvalues 
of its adjacency matrix A. Spectral properties of graph 
matrices play an important role in the study of graphs. 
For instance, the stochastic matrices rule the process of 
diffusion (random walk) on a graph. The right stochastic 
matrix R is obtained from A by dividing the elements of 
each row i by the degree of vertex i. The left stochas- 
tic matrix T, or transfer matrix, is the transpose of R. 
The spectra of stochastic matrices allow to evaluate, for 
instance, the mixing time of the random walk, i. e. the 
time it takes to reach the stationary distribution of the 
process. The latter is obtained by computing the eigen- 
vector of the transfer matrix corresponding to the largest 
eigenvalue. 

Another important matrix is the Laplacian L = D — A, 
where D is the diagonal matrix whose element Da equals 
the degree of vertex i. The matrix L is usually referred 
to as unnormalized Laplacian. In the literature one of- 
ten uses normalized Laplacians (Chung, 1997), of which 
there are two main forms: Lgym = D^^/^LD^^/^ and 
Lrw = D-^L = I - D-^A = I - T. The matrix L^y^ 
is symmetric; Lrw is not symmetric and is closely re- 
lated to a random walk taking place on the graph. All 
Laplacian matrices have a straightforward extension to 
the case of weighted graphs. The Laplacian is one of the 
most studied matrices and finds application in many dif- 
ferent contexts, like graph connectivity (BoUobas, 1998), 
synchronization (Barahona and Pecora, 2002; Nishikawa 
et ai, 2003), diffusion (Chung, 1997) and graph parti- 
tioning (Pothcn, 1997). By construction, the sum of the 
elements of each row of the Laplacian (normalized or un- 
normalized) is zero. This implies that L always has at 
least one zero eigenvalue, corresponding to the eigenvec- 
tor with all equal components, such as (1, 1, 1). Eigen- 
vectors corresponding to different eigenvalues are all or- 
thogonal to each other. Interestingly, L has as many 
zero eigenvalues as there are connected components in the 
graph. So, the Laplacian of a connected graph has but 



one zero eigenvalue, all others being positive. Eigenvec- 
tors of Laplacian matrices are regularly used in spectral 
clustering (Section IV. D). In particular, the eigenvector 
corresponding to the second smallest eigenvalue, called 
Fiedler vector (Fiedler, 1973, 1975), is used for graph bi- 
partitioning, as described in Section IV. A. 



3. Model graphs 

In this section we present the most popular models of 
graphs introduced to describe real systems, at least to 
some extent. Such graphs are useful null models in com- 
munity detection, as they do not have community struc- 
ture, so they can be used for negative tests of clustering 
algorithms. 

The oldest model is that of random graph, proposed 
by Solomonoff and Rapoport (Solomonoff and Rapoport, 
1951) and independently by Erdos and Renyi (Erdos and 
Rcnyi, 1959). There are two parameters: the number of 
vertices n and the connection probability p. Each pair 
of vertices is connected with equal probability p indepen- 
dently of the other pairs. The expected number of edges 
of the graph is pn(n—l) /2, and the expected mean degree 
(k) — p{n — 1). The degree distribution of the vertices 
of a random graph is binomial, and in the limit n — > 00, 
p ^ for fixed (k) it converges to a Poissonian. There- 
fore, the vertices have all about the same degree, close 
to (k) (Fig. 42, top). The most striking property of this 
class of graphs is the phase transition observed by vary- 
ing (k) in the limit n ^ 00. For (fc) < 1, the graph is 
separated in connected components, each of them being 
microscopic, i. e. occupying but a vanishing portion of 
the system size. For (k) > 1, instead, one of the com- 
ponents becomes macroscopic (giant component), i. e. it 
occupies a finite fraction of the graph vertices. 

The diameter of a random graph with n vertices is very 
small, growing only logarithmically with n. This prop- 
erty (small-world effect) is very common in many real 
graphs. The first evidence that social networks are char- 
acterized by paths of small length was provided by a se- 
ries of famous experiments conducted by the phychologist 
Stanley Milgram (Milgram, 1967; Travers and Milgram, 
1969). The expected clustering coefficient of a vertex of a 
random graph is p, as the probability for two vertices to 
be connected is the same whether they are neighbors of 
the same vertex or not. Real graphs, however, are char- 
acterized by far higher values of the clustering coefficient 
as compared to random graphs of the same size. Watts 
and Strogatz (Watts and Strogatz, 1998) showed that the 
small world property and high clustering coefficient can 
coexist in the same system. They designed a class of 
graphs which result from an interpolation between a reg- 
ular lattice, which has high clustering coefficient, and a 
random graph, which has the small-world property. One 
starts from a ring lattice in which each vertex has degree 
k, and with a probability p each edge is rewired to a dif- 
ferent target vertex (Fig. 42, center). It turns out that 
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FIG. 42 Basic models of complex networks. (Top) Erdos- 
Renyi random graph with 100 vertices and a link probability 
p = 0.02. (Center) Small world graph a la Watts-Strogatz, 
with 100 vertices and a rewiring probability p = 0.1. (Bot- 
tom) Barabasi-Albert scale-free network, with 100 vertices 
and an average degree of 2. Courtesy by J. J. Ramasco. 



low values of p suffice to reduce considerably the length 
of shortest paths between vertices, because rewired edges 
act as shortcuts between initially remote regions of the 
graph. On the other hand, the clustering coefficient re- 
mains high, since few rewired edges do not perturb ap- 
preciably the local structure of the graph, which remains 
similar to the original ring lattice. For p = 1 all edges are 
rewired and the resulting structure is a random graph a 
la Erdos and Renyi. 

The seminal paper of Watts and Strogatz triggered a 
huge interest towards the graph representation of real 
systems. One of the most important discoveries was that 
the distribution of the vertex degree of real graphs is very 
heterogeneous (Albert et ai, 1999), with many vertices 
having few neighbors coexisting with some vertices with 
many neighbors. In several cases the tail of this distri- 
bution can be described as a power law with good ap- 
proximation'^^, hence the expression scale-free networks. 
Such degree heterogeneity is responsible for a number of 
remarkable features of real networks, such as resilience to 
random failures/attacks (Albert et ai, 2000), and the ab- 
sence of a threshold for percolation (Cohen et ai, 2000) 
and epidemic spreading (Pastor-Satorras and Vespignani, 
2001). The most popular model of a graph with a power 
law degree distribution is the model by Barabasi and Al- 
bert (Barabasi and Albert, 1999). A version of the model 
for directed graphs had been proposed much earlier by de 
SoUa Price (Price, 1976), building up on previous ideas 
developed by Simon (Simon, 1955). The graph is created 
with a dynamic procedure, where vertices are added one 
by one to an initial core. The probability for a new vertex 
to set an edge with a preexisting vertex is proportional 
to the degree of the latter. In this way, vertices with high 
degree have large probability of being selected as neigh- 
bors by new vertices; if this happens, their degree further 
increases so they will be even more likely to be chosen in 
the future. In the asymptotic limit of infinite number of 
vertices, this rich-gets-richer strategy generates a graph 
with a degree distribution characterized by a power-law 
tail with exponent 3. In Fig. 42 (bottom) we show an 
example of Barabasi-Albert (BA) graph. The cluster- 
ing coefficient of a BA graph decays with the size of the 
graph, and it is much lower than in real networks. More- 
over, the power law decays of the degree distributions 
observed in real networks are characterized by a range 
of exponents' values (usually between 2 and 3), whereas 
the BA model yields a fixed value. However, many re- 
finements of the BA model as well as plenty of differ- 
ent models have been later introduced to account more 
closely for the features observed in real systems (for de- 
tails see (Albert and Barabasi, 2002; Barrat et al., 2008; 



The power law is however not necessary to explain the properties 
of complex networks. It is enough that the tails of the degree 
distributions are "fat" , i. e. spanning orders of magnitude in 
degree. They may or may not be accurately fitted by a power 
law. 
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Boccalctti et al, 2006; Mcndcs and Dorogovtsev, 2003; 
Newman, 2003; Pastor-Satorras and Vcspignani, 2004)). 
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